Posts Tagged ‘servers’

Build a Central Linux Logging Server to Collect, Store, and Visualize All Infrastructure node Logs

Friday, March 20th, 2026

build-a-central-linux-logging-server-to-collect-store-and-visualize-all-infrastructure-node-logs
If you manage multiple servers or collection of multiple services on many nodes within a company server infrastructure, you know the pain of dealing with logs scattered to multiple locations across systems. It is really crazy and takes up a lot of time and drains energy.
One server shows nothing, another rotated logs yesterday, and your app logs are buried somewhere in /var/log/app.

A central logging server solves this problem, as all logs collected, stored, and accessible in one single place.

In this article will present shortly how to build one using ELK Stack + Beats (lightweight agents) on a Linux server.

1. Architecture Overview

Here’s the typical flow looks like this:

[ Servers / Apps ] –> [ Filebeat / Metricbeat ] –> [ Logstash ] –> [ Elasticsearch ] –> [ Kibana / Grafana (Visualization) ]

  • Beats → Lightweight log shippers installed on all machines.
  • Logstash → Optional pipeline for parsing, filtering, and enriching logs.
  • Elasticsearch → Storage and search engine.
  • Kibana / Grafana → Visualization dashboards.

2. Prepare Your Central Logging Server

Requirements:

  • Debian Linux 12 recommended / Ubuntu or Fedora RHEL
  • At least 4 GB RAM (8+ GB for production ELK)
  • Plan enough SSD storage (logs grow fast)
  • Open ports: 5044 for Beats, 9200 for Elasticsearch, 5601 for Kibana

Install Prerequisites

# apt update && sudo apt install openjdk-17-jdk wget curl apt-transport-https -y

ELK requires Java, OpenJDK 17 should work fine.

3. Install Elasticsearch

# wget https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-8.11.1-amd64.deb
# dpkg -i elasticsearch-8.11.1-amd64.deb
# systemctl enable elasticsearch
# systemctl start elasticsearch


Check ElasticSearch server is running:

# curl -X GET "localhost:9200/"

That should see the Cluster info in JSON format.

4. Install Kibana

# wget https://artifacts.elastic.co/downloads/kibana/kibana-8.11.1-amd64.deb
# dpkg -i kibana-8.11.1-amd64.deb
# systemctl enable kibana
# systemctl start kibana


Access Kibana URL in browser:

http://<server-ip>:5601

5. Install Logstash to Process logs before sending to Elasticserch

# wget https://artifacts.elastic.co/downloads/logstash/logstash-8.11.1.deb
# dpkg -i logstash-8.11.1.deb
# systemctl enable logstash
# systemctl start logstash

Logstash allows filtering and structuring logs before sending them to Elasticsearch. Example simple pipeline:

# vim /etc/logstash/conf.d/syslog.conf

input {
  beats {
    port => 5044
  }
}
filter {
  grok { match => { "message" => "%{SYSLOGTIMESTAMP:timestamp} %{SYSLOGHOST:host} %{DATA:program}: %{GREEDYDATA:message}" } }
}
output {
  elasticsearch {
    hosts => [“localhost:9200”]
    index => "central-logs-%{+YYYY.MM.dd}"
  }
}

Start Logstash

# systemctl restart logstash

6. Install Beats on Client Machines

On each server you want to monitor:

# apt install filebeat metricbeat -y


Configure Filebeat

Edit config

# vim  /etc/filebeat/filebeat.yml

Set the output to your central server:

output.logstash:

hosts: [":5044"]

Start the agent:

systemctl enable filebeat
systemctl start filebeat

Do the same for Metricbeat if you want metrics like CPU, memory, disk.

7. Create Dashboards in Kibana or Grafana

  • In Kibana, use Discover to view logs.
  • Create visualizations for errors, warnings, top endpoints, etc.
  • Use Grafana if you want multi-source dashboards, combining logs and metrics.

8. Optional: Secure Your Logging Server

  • Enable TLS/SSL in Beats and Elasticsearch.
  • Use firewall rules to restrict access.
  • Create dedicated users in Elasticsearch for log access.

9. Maintenance Tips

  • Index Lifecycle Management → Rotate daily and delete old logs automatically.
  • Monitor disk usage → Logs grow fast. SSDs are better.
  • Filter noise → Don’t ship debug logs unless needed.
  • Backup Elasticsearch → Especially if logs are critical.

Sum Up, how it Works

  • All logs are centralized → easier troubleshooting.
  • Scalable → add new servers, Beats handle shipping automatically.
  • Searchable → find errors instantly using Elasticsearch.
  • Visual → dashboards in Kibana/Grafana give real-time insight.

How to Harden a Linux Server in 2025 – Practical Steps for Sysadmins to protect against hackers and bots

Thursday, December 11th, 2025

linux_server-hardening-practical-steps-for-sysadmins-protecting-machine-vs-hackers-and-bots-good-practices

Securing a Linux server has never been more importan than ever these days..
With automated attacks, AI-driven exploits, and increasingly complex infrastructure, even a small misconfiguration can lead to a serious breach.
But wait, you don't have to wait to get bumped by a random script kiddie. Good news is you can mitigate a bit attacks with just a few practical and pretty much standard steps, that can can drastically increase your server’s security.

Below is a straightforward, battle-tested hardening guide suitable for Debian, Ubuntu, CentOS, AlmaLinux, and most modern distributions.

1. Keep the System Updated (But Safely)

Outdated packages remain the #1 cause of server compromises.
On Debian/Ubuntu:

# apt update && apt upgrade -y

# apt install unattended-upgrades

On RHEL-based systems:
 

# dnf update -y

# dnf install dnf-automatic

Enable security-only auto-updates where possible. Full auto-updates may break production apps, so use them carefully.

2. Create a Non-Root User and Disable Direct Root Login
 

Attackers constantly brute-force “root”. Avoid letting them.
 

# adduser sysadmin

# usermod -aG sudo sysadmin

Then edit SSH:

# vim /etc/ssh/sshd_config

Set:

PermitRootLogin no

PasswordAuthentication no

And restart:

# systemctl restart sshd


Use SSH keys only.

3. Install a Firewall and Block Everything by Default

UFW (Debian/Ubuntu):

# ufw default deny incoming

# ufw default allow outgoing

#ufw allow ssh

# ufw enable

Firewalld (RHEL/AlmaLinux):

# systemctl enable firewalld –now

# firewall-cmd –permanent –add-service=ssh

# firewall-cmd –reload

Turn off any unneeded ports immediately.

4. Protect SSH with Fail2Ban

Fail2Ban watches log files for suspicious authentication attempts and blocks offenders.

# apt install fail2ban -y

or

# dnf install fail2ban -y

Enable:

# systemctl enable –now fail2ban

To harden SSH jail:

[sshd]

enabled = true

maxretry = 5

bantime = 1h

findtime = 10m

5. Enable Kernel Hardening

Install sysctl rules that protect against common attacks:

Create /etc/sysctl.d/99-hardening.conf:

kernel.kptr_restrict = 2

kernel.sysrq = 0

net.ipv4.conf.all.rp_filter = 1

net.ipv4.tcp_synack_retries = 2

net.ipv4.conf.all.accept_redirects = 0

net.ipv4.conf.all.send_redirects = 0

net.ipv4.conf.all.log_martians = 1

Apply:

# sysctl –system

6. Install and Configure AppArmor or SELinux

Mandatory Access Control significantly limits damage if a service gets compromised.

  • Ubuntu / Debian uses AppArmor by default — ensure it's enabled.
  • RHEL, AlmaLinux, Rocky use SELinux — keep it in enforcing mode unless absolutely necessary.

Check SELinux:

# getenforce

You want:

Enforcing but hopefully you will have to configure all your machine services to venerate and work correctly with selinux enabled.

7. Scan the System with Lynis

Lynis is the best open-source Linux security auditing tool.

# apt install lynis

# lynis audit system

It provides a security score and actionable suggestions.

8. Use 2FA for SSH (Optional but Highly Recommended)

Use Two Factor Authentication:

a. Freely with Oath toolkityou can read how in my previous article how to set up 2fa free software authentication on Linux

or

b. Install Google Authenticator:

# apt install libpam-google-authenticator

# google-authenticator

Enable in /etc/pam.d/sshd:

auth required pam_google_authenticator.so

And in SSH config:

ChallengeResponseAuthentication yes

Restart SSH.

9. Separate Services Using Containers or Systemd Isolation

Even simple servers can benefit from isolation.

Systemd sandbox options:

ProtectSystem=full

ProtectHome=true

ProtectKernelTunables=true

PrivateTmp=true

Add these inside a service file under:

/etc/systemd/system/yourservice.service

It prevents processes from touching parts of the system they shouldn’t.

10. Regular Backups Are Part of Security

A secure server with no backups is a disaster waiting to happen.

Use:

  • rsync
  • borgbackup
  • restic
  • Cloud object storage with versioning

Always encrypt backups and test restore procedures.

Conclusion

Hardening a Linux server in 2025 requires vigilance, good practices, and layered security. No single tool will protect your system — but when you combine SSH security, firewalls, Fail2Ban, kernel hardening, and backups, you eliminate the majority of attack vectors.

 

How to keep your Linux server Healthy for Years: Hard learned lessons

Friday, November 28th, 2025

how-to-keep-your-linux-servers-healthy-every-year-doctor_tux

I’ve been running Linux servers long enough to watch hardware die, kernels panic, filesystems fill up at midnight hours, and network cards slowly burn out like old light bulbs.

Over time, you learn that keeping a server alive is less about “perfect architecture” and more about steady discipline – the small habits built to manage the machines, helps prevent big disasters.

Here are some practical, battle-tested lessons that keep my boxes running for years with minimal downtime. Most of them were learned the hard way.

1. Monitor Before You Fix – and Fix Before It Breaks

Most Linux disasters come from things we should have noticed earlier. The lack of monitoring, there is modern day saying that should become your favourite if you are a sysadmin or Dev Ops engineer.

"Monitoring everything !"

  • The disk that was at 89% yesterday will be at 100% tonight.
  • The log file that grew by 500 MB last week will explode this week.
  • The swap usage creeping from 1% → 5% → 20% means your next heavy task will choke.
  • The unseen failing BIOS CMOS battery
  • The RAID disks degradation etc.

You don’t need enterprise monitoring to prevent this. And even simple tools like monit or a simple zabbix-agent -> zabbix-server or any other simplistic scripted  monitoring gives you a basic issues pre-warning.

Even a simple cronjob shell one liner can save you hours of further sh!t :

#!/bin/bash

df -h / | awk 'NR==2 { if($5+0 > 85) print "Disk Alert: / is at " $5 }' \
| mail -s "Disk Warning on $(hostname)" admin@example.com

2. Treat /etc directory as Sacred – Treat It Like an expensive gem

Every sysadmin eventually faces the nightmare of a broken config overwritten by a package update or a hasty command at 2 AM.

To avoid crying later, archive /etc automatically:

# tar czf /root/etc-$(date +%Y-%m-%d).tar.gz /etc


If you prefer the backup to be more sophisticated you can use my clone of the dirs_backup.sh (an old script I wrote for easifying backup of specific directories on the filesystem ) the etc_backup.sh you can get here.
Run it weekly via cron.
This little trick has saved me more times than I can count — especially when migrating between Debian releases or recovering from accidental edits.

3. Automate everything what you have to repeatevely do

If you find yourself doing something manually more than twice, script it and forget it.

Examples:

  • rotating logs for misbehaving apps
  • restarting services that occasionally get “stuck”
  • syncing backups between machines
  • cleaning temp directories

Here’s a small example I still use today:

#!/bin/bash

# Kill zombie PHP-FPM children that keep leaking memory

ps aux | grep php-fpm | awk '{if($6 > 300000) print $2}' | xargs -r kill -9

Dirty way to get rid of misfunctioning php-fpm ?
Yes. But it works.

4. Backups Don’t Exist Unless You Test Them

It’s easy to feel proud when you write a backup script.
It’s harder – and far more important – to test the restore.

Once a month  or at least once in a few months, try restore a random backup to a dummy VM.
Sometimes backup might fails, or you might get something different from what you originally expected and by doing so
you can guarantee you will not cry later helplessly.

A broken backup doesn’t fail quietly – it fails on the day you need it most.

5. Don’t Ignore Hardware – It Ages like Everything Else

Linux might run forever, but hardware doesn’t.

Signs of impending doom:

  • dmesg spam with I/O errors
  • slow SSD response
  • increasing SMART reallocated sectors
  • random freezes without logs
  • sudden network flakiness

Run this monthly:

6. Document Everything (Future You Will Thank Past You)

There are moments when you ask yourself:

“Why did I configure this machine like this?”

If you don’t document your decisions, you’ll have no idea one year later.

A simple markdown file inside /root/notes.txt or /root/README.md is enough.

Document:

  • installed software
  • custom scripts
  • non-standard configs
  • firewall rules
  • weird hacks you probably forgot already

This turns chaos into something you can actually maintain.

7. Keep Things Simple – Complexity Is the Enemy of Uptime

The longer I work with servers, the more I strip away:

  • fewer moving parts
  • fewer services
  • fewer custom patches
  • fewer “temporary” hacks that become permanent

A simple system is a reliable system.
A complex one dies at the worst possible moment.

8. Accept That Failure Will Still Happen

No matter how careful you are, servers will surely:

  • crash
  • corrupt filesystems
  • lose network connectivity
  • inexplicably freeze
  • reboot after a kernel panic

Don’t aim for perfection.Aim for resilience.

If you can restore the machine in under an hour, you're winning and in the white.

Final Thoughts

Linux is powerful – but it rewards those who treat it with respect and perseverance.
Over many years, I’ve realized that maintaining servers is less about brilliance and more about humble, consistent care and hard work persistence.

I hope this article helps some sysamdmin to rethink and rebundle servers maintenance strategy in a way that will avoid a server meltdown at  night hours like 3 AM.

Cheers ! 

 

Optimizing Linux Server Performance Through Digital Minimalism and Running Services and System Cleanup

Friday, October 3rd, 2025

linux-logo-optimizing-linux-server-performance-digital-minimalism-software-cleanup

In today’s landscape of bloated software stacks, automated dependency chains, and background services that consume memory and CPU without notice, Linux system administrators and enthusiasts alike benefit greatly from embracing digital minimalism of what is setup on the server and to reduce it to the absolute minimum.

Digital minimalism in the context of Linux servers means removing what you don't need, disabling what you don't use, and optimizing what remains — all with the goal of increasing performance, improving security, and simplifying further maintenance.
In this article, we’ll walk through practical steps to declutter your Linux server, optimize resources, and regain control over what’s running and why.

1. Identify and Remove Unnecessary Packages

Over time, many systems accumulate unused packages — either from experiments, dependency installations, or unnecessary defaults.

On Debian/Ubuntu

Find orphaned packages:
 

# apt autoremove --dry-run


Remove unnecessary packages:
 

# apt autoremove
# apt purge <package-name>


List large installed packages:

# dpkg-query -Wf '${Installed-Size}\t${Package}\n' | sort -n | tail -n 20


On RHEL/CentOS/AlmaLinux:

Find orphaned packages:

# dnf autoremove

List packages sorted by size:

# rpm -qia --qf '%{SIZE}\t%{NAME}\n' | sort -n | tail -n 20


2. Audit and Disable Unused Services
 

Every running service consumes memory, CPU cycles, and opens potential attack surfaces.

List enabled services:
 

# systemctl list-unit-files --type=service --state=enabled

See currently running services:

# systemctl --type=service –state=running

Put some good effort to review and disable all unnecesssery

 

Disable unneeded services :

# systemctl disable --now bluetooth.service
# systemctl disable --now cups.service
# systemctl disable --now ModemManager.service

And so on

Useful services to disable (if unused):
 

Service

Purpose

When to Disable

cups.service

Printer daemon

On headless servers

bluetooth.service

Bluetooth stack

On servers without Bluetooth

avahi-daemon

mDNS/Zeroconf

Not needed on most servers

ModemManager

Modem management

If not using 3G/4G cards

NetworkManager

Dynamic net config

Prefer systemd-networkd for static setups


Simple Shell Script to List & Review Services
 

#!/bin/bash
echo "Enabled services:"
systemctl list-unit-files --state=enabled | grep service
echo ""
echo "Running services:"
systemctl --type=service --state=running

3. Optimize Startup and Boot Time

Analyze system boot performance:

# systemd-analyze

View which services take the longest:

# systemd-analyze blame
min 25.852s certbot.service
5min 20.466s logrotate.service
1min 29.748s plocate-updatedb.service
54.595s php5.6-fpm.service
43.445s systemd-logind.service
42.837s e2scrub_reap.service
37.915s apt-daily.service
35.604s mariadb.service
31.509s man-db.service
27.405s systemd-journal-flush.service
18.357s ifupdown-pre.service
14.672s dev-xvda2.device
13.523s rc-local.service
11.024s dpkg-db-backup.service
9.871s systemd-sysusers.service
...

 

Disable or mask long-running services that are not essential.


Why services masking is important?


Simply because after some of consequential updates, some unwanted service daemon might start up with the system boot.

Example:
 

# systemctl mask lvm2-monitor.service


4. Reduce Memory Usage (Especially on Low-RAM VPS)
 

Monitor memory usage:

# free -h
# top
# htop

Use lightweight alternatives:

Service

Heavy

Lightweight Alternative

Web server

Apache

Nginx / Caddy / Lighttpd

Database

MySQL

MariaDB / SQLite (if local)

Syslog

rsyslog

busybox syslog / systemd journal

Shell

bash

dash / ash

File manager

GNOME Files

mc / ranger (CLI)


5. Configure Swap (Only If Needed)
 

Having too much or too little swap can affect performance.


Check if swap is active:

# swapon --show


Create swap file (if needed):

# fallocate -l 1G /swapfile
# chmod 600 /swapfile
# mkswap /swapfile
# swapon /swapfile

Add to /etc/fstab for persistence:

/swapfile none swap sw 0 0

6. Clean Up Cron Jobs and Timers

Old scheduled tasks can silently run in the background and consume resources.

List user cron jobs:

crontab -l

Check system-wide cron jobs:

# ls /etc/cron.*
# ls -al /var/spool/cron/*


List systemd timers:

# systemctl list-timers


Disable any unneeded timers or outdated cron entries.

7. Optimize Logging and Log Rotation

Logs are essential but can grow large and fill up disk space quickly.

Check log size:

# du -sh /var/log/*

Force logrotate:
 

# logrotate -f /etc/logrotate.conf

Edit /etc/logrotate.conf or specific files in /etc/logrotate.d/* to reduce retention if needed.

8. Check for Zombie Processes and Old Users

Old users and zombie processes can indicate neglected cleanup or the server is (cracked) hacked.

List users:
 

cat /etc/passwd | cut -d: -f1

Remove unused accounts:
 

# userdel -r username


Check for zombie processes:
 

# ps aux | awk '{ if ($8 == "Z") print $0; }'

9. Disable IPv6 (if not used)

IPv6 can add unnecessary complexity and attack surface if you’re not using it.

To disable IPv6 temporarily:

# sysctl -w net.ipv6.conf.all.disable_ipv6=1
# sysctl -w net.ipv6.conf.default.disable_ipv6=1

To disable permanently, add to /etc/sysctl.conf:

net.ipv6.conf.all.disable_ipv6 = 1
net.ipv6.conf.default.disable_ipv6 = 1


10. Final Thoughts: Less Is More

Digital minimalism is not just a personal tech trend — it's a philosophy of clarity, performance, and security. Every running process is a potential vulnerability. Every megabyte of RAM consumed by a useless service is wasted capacity. Every package installed increases the system’s complexity.

By regularly auditing, pruning, and simplifying your Linux server, you not only improve its performance and reliability, but you also reduce future maintenance headaches.

Minimalism = Maintainability.

Unlocking the Power of lnav: Logfile Navigator – ncurses text based tool guide to mutiple Logs on multiple servers easy analysis on Linux

Saturday, September 13th, 2025

lnav-syslog-screenshot-linux-virtual-machine

If you've ever found yourself buried under a mountain of log files, tailing multiple outputs, or grepping through endless lines trying to spot an error, it's time to meet your new best friend: lnav, the Logfile Navigator.

Lightweight, terminal-based, and surprisingly powerful, lnav is one of the most underrated tools for developers, sysadmins, and anyone who regularly digs into logs. It turns your chaotic logs into something that’s not only readable—but genuinely useful.

What is lnav and why use it ?

lnav (Logfile Navigator) is a command-line tool for viewing and analyzing log files. It goes beyond tail, less, or grep by:

  • Automatically detecting and merging log formats.
  • Highlighting timestamps, log levels, and errors.
  • Providing SQL-like queries over your logs.
  • Offering interactive navigation with a UI inside the terminal.

And yes, all of that without needing to set up a database or a server.

1. Installing lnav on Linux

Installation is straightforward. On most systems, you can install it via package managers:

On Ubuntu/Debian:

# apt install lnav

On Fedora:

# dnf install lnav

On Arch Linux:

# pacman -S lnav

Or build from source via GitHub if you want the latest version.

2. Use lnav Instead of Tail / Grep why?

Traditional tools are powerful, but they require manual work to chain together functionality. lnav gives you:

  • Automatic multi-log parsing: Drop multiple logs in, and it merges them chronologically.
  • Syntax highlighting: Errors and warnings stand out.
  • SQL querying: Run queries like SELECT * FROM syslog_log WHERE log_level = 'error';
  • Filtering and searching: Use intuitive filters and bookmarks to highlight specific entries.

3. Basic tool Usage is simple

Let’s say you want to inspect a system log:

# lnav /var/log/syslog

You'll immediately get:

  • Color-coded output (timestamps, levels, messages).
  • Scrollable view (arrow keys, PgUp, PgDn).
  • Real-time updates (like tail -f).
  • Search with /, filter with :filter-in, and even SQL queries.

Lets say you need to analyze Apache webserver logs recursively including the logs already rotated and gunzipped with *.gz extension on CentOS / Fedora / RHEL, you can do it with:

# lnav -r /var/log/httpd

You can parse the log file and get additional information about requests as well as you can print overall summary of log file.

Choose the line you want to parse. The selected line is always the one at the top of the window. Then press 'p' and you should see the following result:

https://pc-freak.net/images/lnav-get-extra-information-about-apache-query-with-P-press-key-screenshot-linux

Now, if you want to see a summary view of the logs by date and time, simply press 'i'.

lnav-linux-apache-log-review-summary-of-errors-warnings-normal-screenshot

To quit a screen you have chosen press 'q'.

4. LNAV helpful options and hotkeys

Once you've opened a log file/s for analyze you can use few hotkeys that will allow us to move through the output of lnav and the available views more easily:

e or E to jump to the next / previous error message.
w or W to jump to the next / previous warning message.
b or Backspace to move to the previous page.
Space to move to the next page.
g or G to move to the top / bottom of the current view.

To take a closer look at the way lnav operates, use -d option, the debug information is to be spit inside a .txt file:

# lnav /var/log/httpd -d lnav.txt

In this example, the debug information that is generated when lnav starts will be written to a file named lnav.txt inside the current working directory.

5. Real-World Use Cases

a. Troubleshooting application or system process Crashes

Open all relevant logs in one go:

# lnav /var/log/*.log

Errors are highlighted, and you can jump between them with n / N kbd keys.

b. Combining Multiple Logs

Working with an app that logs to different files and you need to combine:

# lnav /var/log/nginx/access.log /var/log/nginx/error.log


Or lets say you want to combine Apache Webserver with Haproxy log and get log summaries or filter out stuff:

lnav /var/log/apache2/access.log /var/log/haproxy.log


Now you will get a single, chronological timeline of events.

 

If you want to Search for a concrete occurance of Error / Warning or IP address inside a bunch of loaded combined logs you can do it with the same command like in simple vim by pressing / (slash) from kbd and type out what you want to filter out to get shown.

c. Analyze SQL Queries Logs

Yes, you can actually do this by passing it query in its command prompt :

:.schema
:SELECT log_time, log_level, log_message FROM syslog_log WHERE log_level = 'error';

You get a table of filtered logs, sortable by columns.
 

6. lnav more usage command tips

  • :help — Opens the help menu.
  • :filter-in <string> — Show only lines matching <string>.
  • :filter-out <string> — Hide lines matching <string>.
  • :export-to <filename> — Export current view to a file.
  • :tag <tagname> — Tag lines for later reference.
  • q — Quit (but why would you want to?).

 

7. Using lnav as a pager for systemd-journald

journalctl | lnav
# journalctl -f | lnav
# journalctl -u ssh.service | lnav

https://pc-freak.net/images/lnav_sshservice-log-view-screenshot-linux
 

8. Use lnav to review remote ssh logs

Newer versions after 0.10 supports ssh protocol as well and theoretically should work:

# lnav user@server-name-here:/var/log/file.log


To read all logs inside /var/log

# lnav root@server-name-here:/var/log/
# lnav root@server-name-here:/var/log/*.err

9. Using lnav to view docker container logs

# docker logs 811ab84aa95l | lnav
# docker logs -f application | lnav

The latest version of lnav supports even the following  simplified docker:// URL syntax:

# lnav docker://{container_id_or_name}/dir_path/to/log/file
# lnav docker://{container_id_or_name}/var/dir_path/log
# lnav docker://application/var/log/
# lnav docker://applcation/var/log/nginx/nginx.app.log

10. Monitoring compilation and command output useful for developers
 

Compilation from archived tar balls with ./configure && make etc. generate lot of outputs and logs while working. 
Here is where the tool can come handy. 
For example, here is how to watch the output of make command when compiling something:

# lnav -e './configure && make'

 11. Learning lnav tool through online ssh service availability via lnav.org

f you're lazy to install it and want to test it anyways:
 

# Start The Basic Tutorial:
ssh -o PubkeyAuthentication=no -o PreferredAuthentications=password tutorial1@demo.lnav.org


# Playground:
ssh -o PubkeyAuthentication=no -o PreferredAuthentications=password playground@demo.lnav.org


Closure

While tools like Kibana, Grafana, and ELK stacks are powerful, they can be overkill for many use cases—especially when you're SSHed into a box and just need to get answers fast. That’s where lnav shines as it is fast, lightweight, visual and can be used offline.

If you’re a developer, sysadmin, SRE (Site Reliability Engineer), or just someone who cares about logs, give lnav a spin. It might just become among your favorite sysadm tools on Linux and safe you pretty much of time if you have to do log reading and analyzing on daily basis (for example if you're admining 20+ or more Linux servers.

 

Enable Debian Linux automatic updates to keep latest OS Patches / Security Up to Date

Monday, January 13th, 2025

Enable Debian Linux automatic updates to keep latest OS Patches / Security Up to Date

Debian: Entenda a Importância Para o Mundo GNU/LINUX

I'm not a big fan of automatism on GNU / Linux as often using automatic updates could totally mess things especially with a complex and a bit chatic OS-es like is Linux nowadays. 
Nevertheless as Security is becoming more and more of a problem especially the browser security, having a scheduled way to apply updates like every normal modern Windows and MAC OS as an option is becoming essential to have a fully manageble Operating system.

As I use Debian GNU / Linux for desktop for my own personal computer and I have already a lot of Debian servers, whose OS minor level and package version maintenance takes up too big chunk of my time (a time I could dedicated to more useful activities). Thus I found it worthy at some cases to trigger Debian's way to keep the OS and security at a present level, the so called Debian "unattended upgrades".

In this article, I'll explain how to install and Enable Automatic (" Unattended " ) Updates on Debian, with the hope that other Debian users might start benefiting from it.
 

Pros of  enabling automatic updates, are:

  • Debian OS Stay secure without constant monitoring.
  • You Save much time by letting your system handle updates.
  • Presumably Enjoying more peace of mind, knowing your system is more protected.

Cons of enabling automatic updates:

  • Some exotic and bad maintained packages (might break after the update)
  • Customizations made on the OS /etc/sysctl.conf or any other very custom server configs might disappear or not work after the update
  • At worst scenario (a very rare but possible case) OS might fail to boot after update 🙂

Regular security updates patch vulnerabilities that could otherwise be exploited by attackers, which is especially important for servers and systems exposed to the internet, where threats evolve constantly.

1. Update Debian System to latest

Before applying automatic updates making any changes, run apt to update package lists and upgrade any outdated packages,to have automatic updates for a smooth configuration process.

# apt update && apt upgrade -y

2. Install the Unattended-Upgrades deb Package 

# apt install unattended-upgrades -y

Reading package lists… Done
Building dependency tree… Done
Reading state information… Done
The following additional packages will be installed:
  distro-info-data gir1.2-glib-2.0 iso-codes libgirepository-1.0-1 lsb-release python-apt-common python3-apt python3-dbus python3-distro-info python3-gi
Suggested packages:
  isoquery python-apt-doc python-dbus-doc needrestart powermgmt-base
The following NEW packages will be installed:
  distro-info-data gir1.2-glib-2.0 iso-codes libgirepository-1.0-1 lsb-release python-apt-common python3-apt python3-dbus python3-distro-info python3-gi unattended-upgrades
0 upgraded, 11 newly installed, 0 to remove and 0 not upgraded.
Need to get 3,786 kB of archives.
After this operation, 24.4 MB of additional disk space will be used.
Do you want to continue? [Y/n]

 

 

# apt install apt-listchanges
Reading package lists… Done
Building dependency tree… Done
Reading state information… Done
The following package was automatically installed and is no longer required:
  linux-image-5.10.0-30-amd64
Use 'apt autoremove' to remove it.
The following additional packages will be installed:
  python3-debconf
The following NEW packages will be installed:
  apt-listchanges python3-debconf
0 upgraded, 2 newly installed, 0 to remove and 0 not upgraded.
Need to get 137 kB of archives.
After this operation, 452 kB of additional disk space will be used.
Do you want to continue? [Y/n]
Get:1 http://deb.debian.org/debian bookworm/main amd64 python3-debconf all 1.5.82 [3,980 B]
Get:2 http://deb.debian.org/debian bookworm/main amd64 apt-listchanges all 3.24 [133 kB]
Fetched 137 kB in 0s (292 kB/s)
Preconfiguring packages …
Deferring configuration of apt-listchanges until /usr/bin/python3
and python's debconf module are available
Selecting previously unselected package python3-debconf.
(Reading database … 84582 files and directories currently installed.)
Preparing to unpack …/python3-debconf_1.5.82_all.deb …
Unpacking python3-debconf (1.5.82) …
Selecting previously unselected package apt-listchanges.
Preparing to unpack …/apt-listchanges_3.24_all.deb …
Unpacking apt-listchanges (3.24) …
Setting up python3-debconf (1.5.82) …
Setting up apt-listchanges (3.24) …

Creating config file /etc/apt/listchanges.conf with new version

 

Example config for apt-listchanges would be like:

# vim /etc/apt/listchanges.conf
[apt]
frontend=pager
email_address=root
confirm=0
save_seen=/var/lib/apt/listchanges.db
which=both

3. Enable Automatic unattended upgrades

Once installed, enable automatic updates with the following command, which will prompt asking if you want to enable automatic updates. Select Yes and press Enter, which will confirm that the unattended-upgrades service is active and ready to manage updates for you.

# dpkg-reconfigure unattended-upgrades

Configure-Unattended-Upgrades-on-Debian_Linux-dpkg-reconfigure-screenshot

Or non-interactively by running command:

# echo unattended-upgrades unattended-upgrades/enable_auto_updates boolean true | debconf-set-selections
dpkg-reconfigure -f noninteractive unattended-upgrades


4. Set the Schedule for Automatic Updates on Debian

By default, unattended-upgrades runs daily, to verify or modify the schedule, check the systemd timer:

# sudo systemctl status apt-daily.timer
# sudo systemctl status apt-daily-upgrade.timer
# systemctl edit apt-daily-upgrade.timer

Current apt-daily.timer config as of Debian 12 (bookworm) is as follows

root@haproxy2:/etc/apt/apt.conf.d# cat  /lib/systemd/system/apt-daily.timer
[Unit]
Description=Daily apt download activities

[Timer]
OnCalendar=*-*-* 6,18:00
RandomizedDelaySec=12h
Persistent=true

[Install]
WantedBy=timers.target
root@haproxy2:/etc/apt/apt.conf.d#


 

# systemctl edit apt-daily-upgrade.timer

[Timer]
OnCalendar=
OnCalendar=03:00
RandomizedDelaySec=0

 

At Line  num 2 above is needed to reset (empty) the default value shown below in line  num 5.
Line 4 is needed to prevent any random delays coming from the defaults.


Now both timers should be active, if not, activate them with:

# systemctl enable –now apt-daily.timer
# systemctl enable –now apt-daily-upgrade.timer


These timers ensure that updates are checked and applied regularly, without manual intervention.

5.Test one time Automatic Updates on Debian works

To ensure everything is working, simulate an unattended upgrade with a dry run:

# unattended-upgrade –dry-run

 

You can monitor automatic updates by checking the logs.

# less /var/log/unattended-upgrades/unattended-upgrades.log

Log shows details of installed updates and any issues that occurred. Reviewing logs periodically can help you ensure that updates are being applied correctly and troubleshoot any problems.

6. Advanced Configuration Options

If you’re a power user or managing multiple systems, you might want to explore these additional settings in the configuration file:

# vim /etc/apt/apt.conf.d/50unattended-upgrades


Configure unattended-upgrades to send you an email whenever updates are installed.

Unattended-Upgrade::Mail "your-email-address@email-address.com";


Enable automatic reboots after kernel updates
by adding the line:

Unattended-Upgrade::Automatic-Reboot "true";

To schedule reboots after package upgrade is applied  at a specific time:

Unattended-Upgrade::Automatic-Reboot-Time "02:00";

Specify packages you don’t want to be updated by editing the Unattended-Upgrade::Package-Blacklist section in the configuration file.

 

Here is alternative way to configure the unattended upgrade, by using apt configuration options:

# vim /etc/apt/apt.conf.d/02periodic

// Control parameters for cron jobs by /etc/cron.daily/apt-compat //


// Enable the update/upgrade script (0=disable)
APT::Periodic::Enable "1";


// Do "apt-get update" automatically every n-days (0=disable)
APT::Periodic::Update-Package-Lists "1";


// Do "apt-get upgrade –download-only" every n-days (0=disable)
APT::Periodic::Download-Upgradeable-Packages "1";


// Run the "unattended-upgrade" security upgrade script
// every n-days (0=disabled)
// Requires the package "unattended-upgrades" and will write
// a log in /var/log/unattended-upgrades
APT::Periodic::Unattended-Upgrade "1";


// Do "apt-get autoclean" every n-days (0=disable)
APT::Periodic::AutocleanInterval "21";


// Send report mail to root
//     0:  no report             (or null string)
//     1:  progress report       (actually any string)
//     2:  + command outputs     (remove -qq, remove 2>/dev/null, add -d)
//     3:  + trace on
APT::Periodic::Verbose "2";

If you have to simultaneously update multiple machines and you're on a limited connection line, configure download limits if you’re on a metered connection by setting options in /etc/apt/apt.conf.d/20auto-upgrades.

7. Stop Automatic Unattended Upgrade

Under some circumstances if it happens the unattended upgrades are no longer required and you want to revert back to manual package updates, to disable the updates you have to disable the unattended-upgrades service

# systemctl stop unattended-upgrades


8.  Stop an ongoing apt deb package set of updates applied on Debian server

Perhaps not often, but it might be you have run an automated upgrade and this has broke a server system or a service and for that reason you would like to stop the upcoming upgrade (some of whose might have started on other servers) immediately, to do so, the easiest way (not always safe thogh) is to kill the unattended-upgrades daemon.
 

# pkill –signal SIGKILL unattended-upgrades


Note that this a very brutal way to kill it and that might lead to some broken package update, that you might have to later fix manually.

If you have the unattended-upgrade process running on the OS in the process list backgrounded and you want to stop the being on the fly upgrade on the system more safely for the system, you can stop and cancel the ongoing apt upgrade  it by running the ncurses prompt interface, through dpkg-reconfigure

# dpkg-reconfigure unattended-upgrades


Then just select No, press Enter. In my case, this has promptly stopped the ongoing unattended upgrade that seemed blocked (at least as promptly as the hardware seemed to allow 🙂 ).

If you want to disable it for future, so it doesn't automatically gets enabled on next manual update, by some update script disable service as well.
 

# systemctl disable unattended-upgrades

 

Close up

That’s all ! Now, your Debian system will automatically handle security updates, keeping your system secure without you having to do a thing.
The same guide should be good for most Deb based distributions such as Ubuntu / Mint and there rest of other Debian derivative OS-es.
You’ve now set up a reliable way to ensure your system stays protected from vulnerabilities, but anyways it is a good practice to always login and check what the update has done to the system, otherwise expect the unexpected. 

How to Update / Migrate zabbix-agent 5 to zabbix-agent2 6 on Redhat / CentOS / Fedora Linux

Friday, August 9th, 2024

Upgrade-zabbix-agent1-5-to-zabbix-agent2-6-on-RHEL-CentOS-Fedora-Linux-howto-logo

If you have servers reporting monitoring with Zabbix running still on Zabbix-Agent 1 version 5.0.X but already migrated the Zabbix-server to Zabbix 6, it is a good idea to update the Agent to Zabbix Agent 6 As sson as possible, as you know lacking behind in version makes updating harder and more complicated task.

Mine and I guess most system administrators experience points that Keeping at the same level of versioning on many applications historically has shown to reduce unexpected errors and bugs but nowadays, the rule of keeping local and remote application ( programs )  at the same version level is regularly broken.

Theoretically Zabbix-Agent (Client) and Zabbix (Server) has a compitability for a certain range of versions (Zabbix agents 2 from version 4.4 onwards are compatible with Zabbix 7.0; Zabbix agent 2 must not be newer than 7.0 – for more on zabbix agent – > server version compitability check here ) and having a slight version difference should not be really a problem but often you might have a third party proxies in between such as haproxy or zabbix-proxy or other network oddities and thus my personal opinion is that for interoperability it is better to keep the Zabbix Clients and Zabbix Servers across the DMZ-ed networks running at same version level.

Some would say I have an old fashion thinking as software and technology is moving forward, but as I see how programming code writing and even software is constantly degradating just a reflection of degradation of human element, I prefer to keep my old know how and always stick to same versioning whenever possible.

Some would wonder then why would I upgrade to Zabbix-agent2 ? , if have to keep the same versioning, the reason is zabbix-agent2 is written in GO Language and is much faster and supposably better piece of software than Zabbix Agent1 that is written in Python.

Moreover having Zabbix agent 2 instead of 1 gives also benefits as you can do a bit more with zabbix and on the other hand the machines are more ready for monitoring in terms of future. To know more about the Benefits of Zabbix Agent2 compared to Zabbix Agent 1 read the Agent vs Agent2 comparison on zabbix website.

 

With this little introduction, lets proceed with the exact steps to take to upgrade zabbix-agent1 to zabbix-agent2.

1. Check the current installed Zabbix-Agent version 

[user@monitored-server ~]$ rpm -qa |grep -i zabb
zabbix-get-5.0.42-1.el8.x86_64
zabbix-sender-5.0.42-1.el8.x86_64
zabbix-agent-5.0.42-1.el8.x86_64

[user@server ~]$ 

 

2. Create backup copy of current system working zabbix_agentd.conf
 

Before messing up with the working zabbix-agent as usual create the necessery backup to prevent later suprises

[user@monitored-server ~]$ cp -vrpf /etc/zabbix/zabbix_agentd.conf /etc/zabbix/zabbix_agentd.conf.bak-$(date '+%Y-%m-%d_%H-%M-%S')

3. Check current configured Zabbix repos

 

[user@monitored-server ~]$ vim /etc/yum.repos.d/zabbix.repo
 

[zabbix-4.0]
name = zabbix-4.0 – 8
baseurl = http://zabbix-repo-server.com/external/zabbix-4.0/8/$basearch
enabled = 0
gpgkey = http://zabbix-repo-server.com/external/zabbix-4.0/zabbix-official-repo.key
gpgcheck = 1

[zabbix-4.4]
name = zabbix-4.4 – 8
baseurl = http://zabbix-repo-server.com/external/zabbix-4.4/8/$basearch
enabled = 0
gpgkey = http://zabbix-repo-server.com/external/zabbix-4.4/zabbix-official-repo.key
gpgcheck = 1

[zabbix-5.0]
name = zabbix-5.0 – 8
baseurl = http://zabbix-repo-server.com/external/zabbix-5.0/8/$basearch
enabled = 1
gpgkey = http://zabbix-repo-server.com/external/zabbix-5.0/zabbix-official-repo.key
gpgcheck = 1

[zabbix-5.4]
name = zabbix-5.4 – 8
baseurl = http://zabbix-repo-server.com/external/zabbix-5.4/8/$basearch
enabled = 0
gpgkey = http://zabbix-repo-server.com/external/zabbix-5.4/zabbix-official-repo.key
gpgcheck = 1

[zabbix-6.0]
name = zabbix-6.0 – 8
baseurl = http://zabbix-repo-server.com/external/zabbix-6.0/8/$basearch
enabled = 0
gpgkey = http://zabbix-repo-server.com/external/zabbix-6.0/zabbix-official-repo.key
gpgcheck = 1


4. Modify repositories and include the Zabbix Agent6 yum repos 
 

[user@monitored-server ~]$ cp -rpf zabbix.repo zabbix.repo.5.0.rpmsave

As we want to keep only the 6.0 version, leave only the zabbix-6.0 section and enable the repo:
 

[user@monitored-server ~]$ vim /etc/yum.repos.d/zabbix.repo

[zabbix-6.0]
name = zabbix-6.0 – 8
baseurl = http://zabbix-repo-server.com/external/zabbix-6.0/8/$basearch
enabled = 1
gpgkey = http://zabbix-repo-server.com/external/zabbix-6.0/zabbix-official-repo.key
gpgcheck = 1


5. Update zabbix-agent to zabbix-agent2 and update zabbix-get zabbix-sender versions

To not disrupt reported monitoring for zabbix-agent, don't delete zabbix-agent1 but instead in pararallel install and configure
zabbix-agent2 and then once configuration is migrated from Agent 1 to 2, stop the old zabbix-agent and bring up the new one.

[user@monitored-server ~]$ yum check-update

[user@monitored-server ~]$ yum install zabbix-agent2 zabbix-get zabbix-sender -y

Note that if you want to have a precise version number of zabbix-agent that is lets say 6.0.31 to correspond to zabbix-server 6.0.31 (even though in the repositories newer RPM versions are available), run:
 

[user@monitored-server ~]$ yum upgrade zabbix-agent2-6.0.31-release1.el8

 

  • Check new zabbix_agent2 installed version 


# zabbix_agent2 -V
zabbix_agent2 (Zabbix) 6.0.31
Revision b6d93755a1b 17 June 2024, compilation time: {undefined} {undefined}, built with: go1.21.3
Plugin communication protocol version is 6.0.13

Copyright (C) 2024 Zabbix SIA
License GPLv2+: GNU GPL version 2 or later <https://www.gnu.org/licenses/>.
This is free software: you are free to change and redistribute it according to
the license. There is NO WARRANTY, to the extent permitted by law.

This product includes software developed by the OpenSSL Project
for use in the OpenSSL Toolkit (http://www.openssl.org/).

Compiled with OpenSSL 1.1.1k  FIPS 25 Mar 2021
Running with OpenSSL 1.1.1k  FIPS 25 Mar 2021

We use the library Eclipse Paho (eclipse/paho.mqtt.golang), which is
distributed under the terms of the Eclipse Distribution License 1.0 (The 3-Clause BSD License)
available at https://www.eclipse.org/org/documents/edl-v10.php

We use the library go-modbus (goburrow/modbus), which is
distributed under the terms of the 3-Clause BSD License
available at https://github.com/goburrow/modbus/blob/master/LICENSE

 

6. Migrate old /etc/zabbix/zabbix_agentd.conf to /etc/zabbix/zabbix-agent2.conf

For readability to show the main configured variables for zabbix-agent without the tons of comments, to later include in agent2
 

[root@monitored-server ~]# cat /etc/zabbix/zabbix_agentd.conf | grep -v '\#' | sed '/^$/d' 
PidFile=/var/run/zabbix/zabbix_agentd.pid
LogFile=/var/log/zabbix/zabbix_agentd.log
LogFileSize=0
Server=10.50.37.8,127.0.0.1
ServerActive=10.50.37.8,127.0.0.1
Hostname=fqdn-of-monitored-host.domain.com
Timeout=20
Include=/etc/zabbix/zabbix_agentd.d/*.conf

The default zabbix-agent2 installed config would like similar to:

[root@monitored-server ~]# cat /etc/zabbix/zabbix_agent2.conf | grep -v '\#' | sed '/^$/d'
PidFile=/run/zabbix/zabbix_agent2.pid
LogFile=/var/log/zabbix/zabbix_agent2.log
LogFileSize=0
Server=127.0.0.1
# Specify the location of the Zabbix server host.
ServerActive=127.0.0.1
Hostname=Zabbix server
Include=/etc/zabbix/zabbix_agent2.d/*.conf
PluginSocket=/run/zabbix/agent.plugin.sock
ControlSocket=/run/zabbix/agent.sock
Include=./zabbix_agent2.d/plugins.d/*.conf

The new migrate one, should be like:

[root@monitored-server ~]# vim /etc/zabbix/zabbix_agent2.conf
PidFile=/run/zabbix/zabbix_agent2.pid
LogFile=/var/log/zabbix/zabbix_agent2.log
LogFileSize=10
Server=10.34.89.7,127.0.0.1
ServerActive=10.34.89.7,127.0.0.1
Hostname=lqgblu02f.ffm.de.int.atosorigin.com
Timeout=20
Include=/etc/zabbix/zabbix_agent2.d/*.conf
PluginSocket=/run/zabbix/agent.plugin.sock
ControlSocket=/run/zabbix/agent.sock
Include=/etc/zabbix/zabbix_agent2.d/plugins.d/*.conf


7. Add few Optimization variables for better zabbix-server -> zabbix-proxy -> zabbix-server interactions 

If you have sometimes a network delays between zabbix server -> zabbix client and vice versa (depending on whether Zabbix agent is configured as Active or Passive mode), it is often useful 
to add those 2 variables:

# How often list of active checks is refreshed, in seconds
RefreshActiveChecks=60
# Refresh the active checks on start.ForceActiveChecksOnStart=1
ForceActiveChecksOnStart=1


Also it might be a good practice to add zabbix_agent2.log monitoring with the agent itself, if the log exceeds certain amount, instead of calling it via logrotate.
 

# Perform log file rotation at the 1 MB point for the specified filepath
LogFileSize=1

 

[root@monitored-server ~]# vim /etc/zabbix/zabbix_agent2.conf
PidFile=/run/zabbix/zabbix_agent2.pid
LogFile=/var/log/zabbix/zabbix_agent2.log
LogFileSize=10
Server=10.34.89.7,127.0.0.1
ServerActive=10.34.89.7,127.0.0.1
Hostname=lqgblu02f.ffm.de.int.atosorigin.com
RefreshActiveChecks=60
ForceActiveChecksOnStart=1
Timeout=20
Include=/etc/zabbix/zabbix_agent2.d/*.conf
PluginSocket=/run/zabbix/agent.plugin.sock
ControlSocket=/run/zabbix/agent.sock
Include=/etc/zabbix/zabbix_agent2.d/plugins.d/*.conf

 

8. Stop the old zabbix agent process and run the new one

# systemctl status –full zabbix-agent2
# systemctl stop zabbix-agent


Assuming that the configuratoin of zabbix-agent is correct, execute zabbix-agent2 via system control.and check its status
 

# systemctl start zabbix-agent2
# systemctl status –full zabbix-agent2


If no errors in the configuration, the zabbix_agent2 process should be up and running and the status of above systemctl cmd should report fine.
If you need concretics regarding exact Zabbix checks or whther current conigured Userparameter scripts errors, or any other warnings or errors
of zabbix_agent2 interacting to the server, check further the logs

[root@monitored-server ~]# tail -n 10 /var/log/zabbix/zabbix_agent2.log  
2024/08/06 17:26:52.998749 using plugin 'WebPage' (built-in) providing following interfaces: exporter, configurator
2024/08/06 17:26:52.998760 using plugin 'ZabbixAsync' (built-in) providing following interfaces: exporter
2024/08/06 17:26:52.998794 using plugin 'ZabbixStats' (built-in) providing following interfaces: exporter, configurator
2024/08/06 17:26:52.998804 lowering the plugin ZabbixSync capacity to 1 as the configured capacity 100 exceeds limits
2024/08/06 17:26:52.998820 using plugin 'ZabbixSync' (built-in) providing following interfaces: exporter
2024/08/06 17:26:52.998993 Plugin communication protocol version is 6.0.13
2024/08/06 17:26:52.999018 Zabbix Agent2 hostname: [lqgblu02f.ffm.de.int.atosorigin.com]
2024/08/06 17:26:54.000667 [102] cannot connect to [127.0.0.1:10051]: dial tcp :0->127.0.0.1:10051: connect: connection refused
2024/08/06 17:26:54.000836 [102] active check configuration update from host [lqgblu02f.ffm.de.int.atosorigin.com] started to fail
2024/08/06 17:26:59.344837 Zabbix Agent 2 stopped. (6.0.31)

Haproxy Enable / Disable Application backend server configured to roundrobin in emergency case via haproxy socket command

Thursday, May 2nd, 2024

haproxy-stats-socket

Haproxy LB backend BACKEND_ROUNDROBIN are configured to roundrobin with check health check port  (check port 33333).
For example letsa say haproxy server is running with a haproxy_roundrobin.cfg like this one.

Under some circumstances however if check port TCP 33333 is UP, but behind 1 or more of Application that is providing the resources to customers misbehaves ,
(app-server1, app-server2, app-server3, app-server4) members , Load Balancer cannot know this, because traffic routing decision is made based on Echo port.

One example scenario when this can happen is if Application server has issue with connectivity towards Database hosts:
(db-host1, db-host2, db-host3, db-host4)

If this happens 25% of traffic might still get balanced to broken Application server. If such scenario happens during OnCall and this is identified as problem,
work around would be to temporary disable the misbehaving App servers member from the 4 configured roundrobin pairs in haproxyproduction.cfg :

For example if app-server3 App node is identified as failing and 25% via LB is lost, to resolve it until broken Application server node is fixed, you will have to temporary exclude it from the ring of roundrobin backend hosts.

1.  Check the status of haproxy backends

echo "show stat" | socat stdio /var/lib/haproxy/stats

As you can see the backend is disabled.

Another way to do it which will make your sessions to the server not directly cut but kept for some time is to put the server you want to exclude from haproxy roundrobin to "maintenace mode".

echo "set server bk_BACKEND_ROUNDROBIN/app-server3 state maint" | socat unix-connect:/var/lib/haproxy/stats stdio

Actually, there is even better and more advanced way to disable backend from a configured rounrobin pair of hosts, with putting the available connections in a long waiting queue in the proxy, and if the App host is inavailable for not too short, haproxy will just ask the remote client to keep the connection for longer and continue the session interaction to remote side and wait for the App server connectivity to go out of maintenance, this is done via "drain" option.

echo "set server bk_BACKEND_ROUNDROBIN/app-server3 state drain" | socat unix-connect:/var/lib/haproxy/stats stdio

 

  • This sets the backend in DRAIN mode. No new connections are accepted and existing connections are drained.

To get a better idea on what is drain state, here is excerpt from haproxy official documentation:

Force a server's administrative state to a new state. This can be useful to
disable load balancing and/or any traffic to a server. Setting the state to
"ready" puts the server in normal mode, and the command is the equivalent of
the "enable server" command. Setting the state to "maint" disables any traffic
to the server as well as any health checks. This is the equivalent of the
"disable server" command. Setting the mode to "drain" only removes the server
from load balancing but still allows it to be checked and to accept new
persistent connections. Changes are propagated to tracking servers if any.


2. Disable backend app-server3 from rounrobin 


 

echo "disable server BACKEND_ROUNDROBIN/app-server3" | socat unix-connect:/var/lib/haproxy/stats stdio

# pxname,svname,qcur,qmax,scur,smax,slim,stot,bin,bout,dreq,dresp,ereq,econ,eresp,wretr,wredis,status,weight,act,bck,chkfail,chkdown,lastchg,downtime,qlimit,pid,iid,sid,throttle,lbtot,tracked,type,rate,rate_lim,rate_max,check_status,check_code,check_duration,hrsp_1xx,hrsp_2xx,hrsp_3xx,hrsp_4xx,hrsp_5xx,hrsp_other,hanafail,req_rate,req_rate_max,req_tot,cli_abrt,srv_abrt,comp_in,comp_out,comp_byp,comp_rsp,lastsess,last_chk,last_agt,qtime,ctime,rtime,ttime,
stats,FRONTEND,,,0,0,3000,0,0,0,0,0,0,,,,,OPEN,,,,,,,,,1,2,0,,,,0,0,0,0,,,,0,0,0,0,0,0,,0,0,0,,,0,0,0,0,,,,,,,,
stats,BACKEND,0,0,0,0,300,0,0,0,0,0,,0,0,0,0,UP,0,0,0,,0,282917,0,,1,2,0,,0,,1,0,,0,,,,0,0,0,0,0,0,,,,,0,0,0,0,0,0,-1,,,0,0,0,0,
Frontend_Name,FRONTEND,,,0,0,3000,0,0,0,0,0,0,,,,,OPEN,,,,,,,,,1,3,0,,,,0,0,0,0,,,,,,,,,,,0,0,0,,,0,0,0,0,,,,,,,,
Backend_Name,app-server4,0,0,0,0,,0,0,0,,0,,0,0,0,0,UP,1,1,0,1,0,282917,0,,1,4,1,,0,,2,0,,0,L4OK,,12,,,,,,,0,,,,0,0,,,,,-1,,,0,0,0,0,
Backend_Name,app-server3,0,0,0,0,,0,0,0,,0,,0,0,0,0,MAINT,1,0,1,1,2,2,23,,1,4,2,,0,,2,0,,0,L4OK,,11,,,,,,,0,,,,0,0,,,,,-1,,,0,0,0,0,
Backend_Name,BACKEND,0,0,0,0,300,0,0,0,0,0,,0,0,0,0,UP,1,1,0,,0,282917,0,,1,4,0,,0,,1,0,,0,,,,,,,,,,,,,,0,0,0,0,0,0,-1,,,0,0,0,0,

Once it is confirmed from Application supprt colleagues, that machine is out of maintenance node and working properly again to reenable it:

3. Enable backend app-server3

echo "enable server bk_BACKEND_ROUNDROBIN/app-server3" | socat unix-connect:/var/lib/haproxy/stats stdio

4. Check backend situation again

echo "show stat" | socat stdio /var/lib/haproxy/stats
# pxname,svname,qcur,qmax,scur,smax,slim,stot,bin,bout,dreq,dresp,ereq,econ,eresp,wretr,wredis,status,weight,act,bck,chkfail,chkdown,lastchg,downtime,qlimit,pid,iid,sid,throttle,lbtot,tracked,type,rate,rate_lim,rate_max,check_status,check_code,check_duration,hrsp_1xx,hrsp_2xx,hrsp_3xx,hrsp_4xx,hrsp_5xx,hrsp_other,hanafail,req_rate,req_rate_max,req_tot,cli_abrt,srv_abrt,comp_in,comp_out,comp_byp,comp_rsp,lastsess,last_chk,last_agt,qtime,ctime,rtime,ttime,
stats,FRONTEND,,,0,0,3000,0,0,0,0,0,0,,,,,OPEN,,,,,,,,,1,2,0,,,,0,0,0,0,,,,0,0,0,0,0,0,,0,0,0,,,0,0,0,0,,,,,,,,
stats,BACKEND,0,0,0,0,300,0,0,0,0,0,,0,0,0,0,UP,0,0,0,,0,282955,0,,1,2,0,,0,,1,0,,0,,,,0,0,0,0,0,0,,,,,0,0,0,0,0,0,-1,,,0,0,0,0,
Frontend_Name,FRONTEND,,,0,0,3000,0,0,0,0,0,0,,,,,OPEN,,,,,,,,,1,3,0,,,,0,0,0,0,,,,,,,,,,,0,0,0,,,0,0,0,0,,,,,,,,
Backend_Name,app-server4,0,0,0,0,,0,0,0,,0,,0,0,0,0,UP,1,1,0,1,0,282955,0,,1,4,1,,0,,2,0,,0,L4OK,,12,,,,,,,0,,,,0,0,,,,,-1,,,0,0,0,0,
Backend_Name,app-server3,0,0,0,0,,0,0,0,,0,,0,0,0,0,UP,1,0,1,1,2,3,58,,1,4,2,,0,,2,0,,0,L4OK,,11,,,,,,,0,,,,0,0,,,,,-1,,,0,0,0,0,
Backend_Name,BACKEND,0,0,0,0,300,0,0,0,0,0,,0,0,0,0,UP,1,1,1,,0,282955,0,,1,4,0,,0,,1,0,,0,,,,,,,,,,,,,,0,0,0,0,0,0,-1,,,0,0,0,0,


You should see the backend enabled again.

NOTE:
If you happen to get some "permission denied" errors when you try to send haproxy commands via the configured haproxy status this might be related to the fact you have enabled the socket in read only mode, if that is so it means the haproxy cannot be written to and therefore you can only read info from it with status commands, but not send any write operations to haproxy via unix socket.

One example haproxy configuration that enables haproxy socket in read only looks like this in haproxy.cfg:
 

 stats socket /var/lib/haproxy/stats


To make the haproxy socket read / write mode, for root superuser and some other users belonging to admin group 'adm', you should set the haproxy.cfg to something like:

stats socket /var/lib/haproxy/stats-qa mode 0660 group adm level admin

or if no special users with a set admin group needed to have access to socket, use instead config like:

stats socket /var/lib/haproxy/stats-qa.sock mode 0600 level admin

Configure aide file integrity check server monitoring in Zabbix to track for file changes on servers

Tuesday, March 28th, 2023

zabbix-aide-log-monitoring-logo

Earlier I've written a small article on how to setup AIDE monitoring for Server File integrity check on Linux, which put the basics on how this handy software to improve your server overall Security software can be installed and setup without much hassle.

Once AIDE is setup and a preset custom configuration is prepared for AIDE it is pretty useful to configure AIDE to monitor its critical file changes for better server security by monitoring the AIDE log output for new record occurs with Zabbix. Usually if no files monitored by AIDE are modified on the machine, the log size will not grow, but if some file is modified once Advanced Linux Intrusion Detecting (aide) binary runs via the scheduled Cron job, the /var/log/app_aide.log file will grow zabbix-agentd will continuously check the file for size increases and will react.

Before setting up the Zabbix required Template, you will have to set few small scripts that will be reading a preconfigured list of binaries or application files etc. that aide will monitor lets say via /etc/aide-custom.conf
 

1. Configure aide to monitor files for changes


Before running aide, it is a good idea to prepare a file with custom defined directories and files that you plan to monitor for integrity checking e.g. future changes with aide, for example to capture bad intruders who breaks into server which runs aide and modifies critical files such as /etc/passwd /etc/shadow /etc/group or / /usr/local/etc/* or /var/* / /usr/* critical files that shouldn't be allowed to change without the admin to soon find out.

# cat /etc/aide-custom.conf

# Example configuration file for AIDE.
@@define DBDIR /var/lib/aide
@@define LOGDIR /var/log/aide
# The location of the database to be read.
database=file:@@{DBDIR}/app_custom.db.gz
database_out=file:@@{DBDIR}/app_aide.db.new.gz
gzip_dbout=yes
verbose=5

report_url=file:@@{LOGDIR}/app_custom.log
#report_url=syslog:LOG_LOCAL2
#report_url=stderr
#NOT IMPLEMENTED report_url=mailto:root@foo.com
#NOT IMPLEMENTED report_url=syslog:LOG_AUTH

# These are the default rules.
#
#p:      permissions
#i:      inode:
#n:      number of links
#u:      user
#g:      group
#s:      size
#b:      block count
#m:      mtime
#a:      atime
#c:      ctime
#S:      check for growing size
#acl:           Access Control Lists
#selinux        SELinux security context
#xattrs:        Extended file attributes
#md5:    md5 checksum
#sha1:   sha1 checksum
#sha256:        sha256 checksum
#sha512:        sha512 checksum
#rmd160: rmd160 checksum
#tiger:  tiger checksum

#haval:  haval checksum (MHASH only)
#gost:   gost checksum (MHASH only)
#crc32:  crc32 checksum (MHASH only)
#whirlpool:     whirlpool checksum (MHASH only)

FIPSR = p+i+n+u+g+s+m+c+acl+selinux+xattrs+sha256

#R:             p+i+n+u+g+s+m+c+acl+selinux+xattrs+md5
#L:             p+i+n+u+g+acl+selinux+xattrs
#E:             Empty group
#>:             Growing logfile p+u+g+i+n+S+acl+selinux+xattrs

# You can create custom rules like this.
# With MHASH…
# ALLXTRAHASHES = sha1+rmd160+sha256+sha512+whirlpool+tiger+haval+gost+crc32
ALLXTRAHASHES = sha1+rmd160+sha256+sha512+tiger
# Everything but access time (Ie. all changes)
EVERYTHING = R+ALLXTRAHASHES

# Sane, with multiple hashes
# NORMAL = R+rmd160+sha256+whirlpool
NORMAL = FIPSR+sha512

# For directories, don't bother doing hashes
DIR = p+i+n+u+g+acl+selinux+xattrs

# Access control only
PERMS = p+i+u+g+acl+selinux

# Logfile are special, in that they often change
LOG = >

# Just do sha256 and sha512 hashes
LSPP = FIPSR+sha512

# Some files get updated automatically, so the inode/ctime/mtime change
# but we want to know when the data inside them changes
DATAONLY =  p+n+u+g+s+acl+selinux+xattrs+sha256

##############TOUPDATE
#To delegate to app team create a file like /app/aide.conf
#and uncomment the following line
#@@include /app/aide.conf
#Then remove all the following lines
/etc/zabbix/scripts/check.sh FIPSR
/etc/zabbix/zabbix_agentd.conf FIPSR
/etc/sudoers FIPSR
/etc/hosts FIPSR
/etc/keepalived/keepalived.conf FIPSR
# monitor haproxy.cfg
/etc/haproxy/haproxy.cfg FIPSR
# monitor keepalived
/home/keepalived/.ssh/id_rsa FIPSR
/home/keepalived/.ssh/id_rsa.pub FIPSR
/home/keepalived/.ssh/authorized_keys FIPSR

/usr/local/bin/script_to_run.sh FIPSR
/usr/local/bin/another_script_to_monitor_for_changes FIPSR

#  cat /usr/local/bin/aide-config-check.sh
#!/bin/bash
/sbin/aide -c /etc/aide-custom.conf -D

# cat /usr/local/bin/aide-init.sh
#!/bin/bash
/sbin/aide -c /etc/custom-aide.conf -B database_out=file:/var/lib/aide/custom-aide.db.gz -i

 

# cat /usr/local/bin/aide-check.sh

#!/bin/bash
/sbin/aide -c /etc/custom-aide.conf -Breport_url=stdout -B database=file:/var/lib/aide/custom-aide.db.gz -C|/bin/tee -a /var/log/aide/custom-aide-check.log|/bin/logger -t custom-aide-check-report
/usr/local/bin/aide-init.sh

 

# cat /usr/local/bin/aide_app_cron_daily.txt

#!/bin/bash
#If first time, we need to init the DB
if [ ! -f /var/lib/aide/app_aide.db.gz ]
   then
    logger -p local2.info -t app-aide-check-report  "Generating NEW AIDE DATABASE for APPLICATION"
    nice -n 18 /sbin/aide –init -c /etc/aide_custom.conf
    mv /var/lib/aide/app_aide.db.new.gz /var/lib/aide/app_aide.db.gz
fi

nice -n 18 /sbin/aide –update -c /etc/aide_app.conf
#since the option for syslog seems not fully implemented we need to push logs via logger
/bin/logger -f /var/log/aide/app_aide.log -p local2.info -t app-aide-check-report
#Acknoledge the new database as the primary (every results are sended to syslog anyway)
mv /var/lib/aide/app_aide.db.new.gz /var/lib/aide/app_aide.db.gz

What above cron job does is pretty simple, as you can read it yourself. If the configuration predefined aide database store file /var/lib/aide/app_aide.db.gz, does not
exists aide will create its fresh empty database and generate a report for all predefined files with respective checksums to be stored as a comparison baseline for file changes. 

Next there is a line to write aide file changes via rsyslog through the logger and local2.info handler


2. Setup Zabbix Template with Items, Triggers and set Action

2.1 Create new Template and name it YourAppName APP-LB File integrity Check

aide-itengrity-check-zabbix_ Configuration of templates

Then setup the required Items, that will be using zabbix's Skip embedded function to scan file in a predefined period of file, this is done by the zabbix-agent that is
supposed to run on the server.

2.2 Configure Item like

aide-zabbix-triggers-screenshot
 

*Name: check aide log file

Type: zabbix (active)

log[/var/log/aide/app_aide.log,^File.*,,,skip]

Type of information: Log

Update Interval: 30s

Applications: File Integrity Check

Configure Trigger like

Enabled: Tick On

images/aide-zabbix-screenshots/check-aide-log-item


2.3 Create Triggers with the respective regular expressions, that would check the aide generated log file for file modifications


aide-zabbix-screenshot-minor-config

Configure Trigger like
 

Enabled: Tick On


*Name: Someone modified {{ITEM.VALUE}.regsub("(.*)", \1)}

*Expression: {PROD APP-LB File Integrity Check:log[/var/log/aide/app_aide.log,^File.*,,,skip].strlen()}>=1

Allow manual close: yes tick

*Description: Someone modified {{ITEM.VALUE}.regsub("(.*)", \1)} on {HOST.NAME}

 

2.4 Configure Action

 

aide-zabbix-file-monitoring-action-screensho

Now assuming the Zabbix server has  a properly set media for communication and you set Alerting rules zabbix-server can be easily set tosend mails to a Support email to get Notifications Alerts, everytime a monitored file by aide gets changed.

That's all folks ! Enjoy being notified on every file change on your servers  !
 

Linux extending life time for a damaged hard drive server tricks on a live server. Force fcsk on next reboot.Read-only file system error solutions

Friday, February 17th, 2023

linux-extending-life-time-for-a-damaged-hard-drive-server-tricks-can-not-read-superblock-linux-force-fsck-on-next-reboot

In our daily work as system administrators we have some very old Legacy systems running Clustered High Availability proxies using CRM (Cluster Resource Manager) and some legacy systems still using Heartbeat to manage the cluster instead of the newer and modern Corosync variant.

The HA cluster is only 2 nodes Linux machine and running the obscure already long time unsupported version of Redhat 5.11 (Ootpa) who was officially became stable distant year 1998 (yeath the years were good) and whose EOL (End of Life) has been reached long time ago and the OS is no longer supported, however for about 14 years the machines has been running perfectly fine until one of the Cluster nodes managed by ocf::heartbeat:IPAddr2 , that is  /etc/ha.d/resource.d/IPAddr2 shell script. Yeah for the newbies Heartbeat Application Cluster in Linux does work like that it uses a number of extendable pair of shell scripts written for different kind of Network / Web / Mail / SQL or whatever services HA management.

The first node configured however, started failing due to some errors like:
 

EXT3-fs error (device dm-1): ext3_journal_start_sb: Detected aborted journal
sd 0:2:0:0: rejecting I/O to offline device
Aborting journal on device sda1.
sd 0:2:0:0: rejecting I/O to offline device
printk: 159 messages suppressed.
Buffer I/O error on device sda1, logical block 526
lost page write due to I/O error on sda1
sd 0:2:0:0: rejecting I/O to offline device
sd 0:2:0:0: rejecting I/O to offline device
ext3_abort called.
EXT3-fs error (device sda1): ext3_journal_start_sb: Detected aborted journal
Remounting filesystem read-only
sd 0:2:0:0: rejecting I/O to offline device
sd 0:2:0:0: rejecting I/O to offline device
sd 0:2:0:0: rejecting I/O to offline device
sd 0:2:0:0: rejecting I/O to offline device
sd 0:2:0:0: rejecting I/O to offline device
megaraid_sas: FW was restarted successfully, initiating next stage…
megaraid_sas: HBA recovery state machine, state 2 starting…
megasas: Waiting for FW to come to ready state
megasas: FW in FAULT state!!
FW state [-268435456] hasn't changed in 180 secs
megaraid_sas: out: controller is not in ready state
megasas: waiting_for_outstanding: after issue OCR. 
megasas: waiting_for_outstanding: before issue OCR. FW state = f0000000
megaraid_sas: pending commands remain even after reset handling. megasas[0]: Dumping Frame Phys Address of all pending cmds in FW
megasas[0]: Total OS Pending cmds : 0 megasas[0]: 64 bit SGLs were sent to FW
megasas[0]: Pending OS cmds in FW :

The result out of that was a frequently the filesystem of the machine got re-mounted as Read Only and of course that is
quite bad if you have a running processess of haproxy that should be able to be living their and take up some Web traffic
for high availability and you run all the traffic only on the 2nd pair of machine.

This of course was a clear sign for a failing disks or some hit bad blocks regions or as the messages indicates, some
problem with system hardware or Raid SAS Array.

The physical raid on the system, just like rest of the hardware is very old stuff as well.

[root@haproxy_lb_node1 ~]# lspci |grep -i RAI
01:00.0 RAID bus controller: LSI Logic / Symbios Logic MegaRAID SAS 2108 [Liberator] (rev 05)

The produced errors not only made the machine to auto-mount its root / filesystem in Read-Only mode but besides has most
likely made the machine to automatically reboot every few days or few times every day in a raw.

The second Load Balancer node2 did operated perfectly, and we thought that we might just keep the broken machine in that half running
and inconsistent state for few weeks until we have built the new machines with Pre-Installed new haproxy cluster with modern
RedHat Linux 8.6 distribution, but since we have to follow SLAs (Service Line Agreements) with Customers and the end services behind the
High Availability (HA) Haproxy cluster were at danger … 

We as sysadmins had the task to make our best to try to stabilize the unstable node with disk errors for the system to servive
and be able to normally serve traffic (if node2 that is in a separate Data center fails due to a hardware or electricity issues etc.)
.

Here is few steps we took, that has hopefully improved the situation.

1. Make backups of most important files of high importance

Always before doing anything with a broken system, prepare backup of the most important files, if that is a cluster that should be a backup of the cluster configurations (if you don't have already ones) backup of /etc/hosts / backup of any important services configs /etc/haproxy/haproxy.cfg /etc/postfix/postfix.cfg (like it was my case), preferrably backup of whole /etc/  any important files from /root/ or /home/users* directories backup of at leasts latest logs from /var/log etc.
 

2. Clear up all unnecessery services scripts from the server

Any additional Softwares / Services and integrity checking tools (daemons) / scripts and cron jobs, were immediately stopped and wheter unused removed.

E.g. we had moved through /etc/cron* to check what's there,

# ls -ld /etc/cron.*
drwx—— 2 root root 4096 Feb  7 18:13 /etc/cron.d
drwxr-xr-x 2 root root 4096 Feb  7 17:59 /etc/cron.daily
-rw-r–r– 1 root root    0 Jul 20  2010 /etc/cron.deny
drwxr-xr-x 2 root root 4096 Jan  9  2013 /etc/cron.hourly
drwxr-xr-x 2 root root 4096 Jan  9  2013 /etc/cron.monthly
drwxr-xr-x 2 root root 4096 Aug 26  2015 /etc/cron.weekly

 

And like well professional butchers removed everything unnecessery that could trigger any extra unnecessery disk read / writes to HDD.

E.g. just create

# mkdir -p /root/etc_old/{/etc/cron.d,\
/etc/cron.daily,/etc/cron.hourly,/etc/cron.monthly\
,/etc/cron.weekly}

 

And moved all unnecessery cron job scripts like:

1. nmon (old school network / memory / hard disk console tool for monitoring and tuning server parameters)
2. clamscan / freshclam crons
3. mlocate (the script that is taking care for periodic run of updatedb command to keep the locate command to easily search
for files inside the DB to put less read operations on disk in case if you need to find file (e.g. prevent yourself to everytime
run cmd like: find / . -iname '*whatever_you_look_for*'
4. cups cron jobs
5. logwatch cron
6. rkhunter stuff
7. logrotate (yes we stopped even logrotation trigger job as we found the server was crashing sometimes at the same time when
the lograte job to rotate logs inside /var/log/* was running perhaps leading to a hit of the I/O read error (bad blocks).


Also inspected the Administrator user root cron job for any unwated scripts and stopped two report bash scripts that were part of the PCI tightened Security procedures.
Therein found script responsible to periodically report the list of installed packages and if they have not changed, as well a script to periodically report via email the list of
/etc/{passwd,/etc/shadow} created users, used to historically keep an eye on the list of users and easily see if someone
has created new users on the machine. Those were enabled via /var/spool/cron/root cron jobs, in other cases, on other machines if it happens for you
it is a good idea to check out all the existing user cron jobs and stop anything that might be putting Read / Write extra heat pressure on machine attached the Hard drives.

# ls -al /var/spool/cron/
total 20
drwx——  2 root root 4096 Nov 13  2015 .
drwxr-xr-x 12 root root 4096 May 11  2011 ..
-rw——-  1 root root  133 Nov 13  2015 root


3. Clear up old log files and any files unnecessery

Under /var/log and /home /var/tmp /var/spool/tmp immediately try to clear up the old log files.
From my past experience this has many times made the FS file inodes that are storing on a unbroken part (good blocks) of the hard drive and
ready to be reused by newly written rsyslog / syslogd services spitted files.

!!! Note that during the removal of some files you might hit a files stored on a bad blocks that might lead to a unexpected system reboot.

But that's okay, don't worry most likely after a hard reset by a technician in the Datacenter the machine will boot again and you can enjoy
removing remaining still files to send them to the heaven for old files.

 

4. Trigger an automatic system file system check with fsck on next boot

The standard way to force a Linux to aumatically recheck its Root filesystem is to simply create the /forcefsck to root partition or any other secondary disk partition you would like to check.

# touch /forcefsck

# reboot


However at some occasions you might be unable to do it because, the / (root fs) has been remounted in ReadOnly mode, yackes …

Luckily old Linux distibutions like this RHEL 5.1, has a way to force a filesystem check after reboot fsck and identify any
unknown bad-blocks and hopefully succceed in isolating them, so you don't hit into the same auto-reboots if the hard drive or Software / Hardware RAID
is not in terrible state
, you can use an option built in in /sbin/shutdown command the '-F'

   -F     Force fsck on reboot.


Hence to make the machine reboot and trigger immediately fsck:

# shutdown -rF now


Just In case you wonder why to reboot before check the Filesystem. Well simply because you need to have them unmounted before you check.

In that specific case this produced so far a good result and the machine booted just fine and we crossed the fingers and prayed that the machine would work flawlessly in the coming few weeks, before we finalize the configuration of the substitute machines, where this old infrastructure will be migrated to a new built cluster with new Haproxy and Corosync / Pacemaker Cluster on a brand new RHEL.

NB! On newer machines this won't work however as shutdown command has been stripped off this option because no SystemV (SystemInit) or Upstart and not on SystemD newer services architecture.
 

5. Hints on checking the hard drives with fsck

If you happen to be able to have physical access to the remote Hardare machine via a TTY[1-9] Console, that's even better and is the standard way to do it but with this specific case we had no easy way to get access to the Physical server console.

It is even better to go there and via either via connected Monitor (Display) or KVM Switch (Those who hear KVM switch first time this is a great device in server rooms to connect multiple monitors to same Monitor Display), it is better to use a some of the multitude of options to choose from for USB Distro Linux recovery OS versions or a CDROM / DVD on older machines like this with the Redhat's recovery mode rolled on.
After mounting the partition simply check each of the disks
e.g. :

# fsck -y /dev/sdb
# fsck -y /dev/sdc

Or if you want to not waste time and look for each hard drive but directly check all the ones that are attached and known by Linux distro via /etc/fstab definition run:

# fsck -AR

If necessery and you have a mixture of filesystems for example EXT3 , EXT4 , REISERFS you can tell it to omit some filesystem, for example ext3, like that:

# fsck -AR -t noext3 -y


To skip fsck on mounted partitions with fsck:

# fsck -M /dev/sdb


One remark to make here on fsck is usually fsck to complete its job on various filesystem it uses other external component binaries usually stored in /sbin/fsck*

ls -al /sbin/fsck*
-rwxr-xr-x 1 root root  55576 20 яну 2022 /sbin/fsck*
-rwxr-xr-x 1 root root  43272 20 яну 2022 /sbin/fsck.cramfs*
lrwxrwxrwx 1 root root      9  4 юли 2020 /sbin/fsck.exfat -> exfatfsck*
lrwxrwxrwx 1 root root      6  7 юни 2021 /sbin/fsck.ext2 -> e2fsck*
lrwxrwxrwx 1 root root      6  7 юни 2021 /sbin/fsck.ext3 -> e2fsck*
lrwxrwxrwx 1 root root      6  7 юни 2021 /sbin/fsck.ext4 -> e2fsck*
-rwxr-xr-x 1 root root  84208  8 фев 2021 /sbin/fsck.fat*
-rwxr-xr-x 2 root root 393040 30 ное 2009 /sbin/fsck.jfs*
-rwxr-xr-x 1 root root 125184 20 яну 2022 /sbin/fsck.minix*
lrwxrwxrwx 1 root root      8  8 фев 2021 /sbin/fsck.msdos -> fsck.fat*
-rwxr-xr-x 1 root root    333 16 дек 2021 /sbin/fsck.nfs*
lrwxrwxrwx 1 root root      8  8 фев 2021 /sbin/fsck.vfat -> fsck.fat*


6. Using tune2fs to  adjust tunable filesystem parameters on ext2/ext3/ext4 filesystems (few examples)

a) To check whether really the filesystem was checked on boot time or check a random filesystem on the server for its last check up date with fsck:

#  tune2fs -l /dev/sda1 | grep checked
Last checked:             Wed Apr 17 11:04:44 2019

On some distributions like old Debian and Ubuntu, it is even possible to enable fsck to log its operations during check on reboot via changing the verbosity from NO to YES:

# sed -i "s/#VERBOSE=no/VERBOSE=yes/" /etc/default/rcS


If you're having the issues on old Debian Linuxes  and not on RHEL  it is possible to;

b) Enable all fsck repairs automatic on boot

by running via:
 

# sed -i "s/FSCKFIX=no/FSCKFIX=yes/" /etc/default/rcS


c) Forcing fcsk check on for server attached Hard Drive Partitions with tune2fs

# tune2fs -c 1 /dev/sdXY

Note that:
tune2fs can force a fsck on each reboot for EXT4, EXT3 and EXT2 filesystems only.

tune2fs can trigger a forced fsck on every reboot using the -c (max-mount-counts) option.
This option sets the number of mounts after which the filesystem will be checked, so setting it to 1 will run fsck each time the computer boots.
Setting it to -1 or 0 resets this (the number of times the filesystem is mounted will be disregarded by e2fsck and the kernel).


 For example you could:

d) Set fsck to run a filesystem check every 30 boots, by using -c 30 
 

# tune2fs -c 30 /dev/sdXY


e) Checking whether a Hard Drive has been really checked on the boot

 

#  tune2fs -l /dev/sda1 | grep checked
Last checked:             Wed Apr 17 11:04:44 2019


e) Check when was the last time the file system /dev/sdX was checked:
 

# tune2fs -l /dev/sdX | grep Last\ c
Last checked:             Thu Jan 12 20:28:34 2017


f) Check how many times our /dev/sdX filesystem was mounted

# tune2fs -l /dev/sdX | grep Mount
Mount count:              157

g) Check how many mounts are allowed to pass before filesystem check is forced
 

# tune2fs -l /dev/sdX | grep Max
Maximum mount count:      -1


7. Repairing disk / partitions via GRUB fsck.mode and fsck.repair kernel module options

It is also possible to force a fsck.repair on boot via GRUB, but that usually is not an option someone would like as the machine might fail too boot if it hards to repair hardly, however in difficult situations with failing disks temporary enabling it is good idea.

This can be done by including for grub initial config

GRUB_CMDLINE_LINUX_DEFAULT="quiet splash fsck.mode=force fsck.repair=yes"

fsck.mode=force – will force a fsck each time a system boot and keeping that value enabled for a long time inside GRUB is stupid for servers as

sometimes booting could be severely prolonged because of the checks especially with servers with many or slow old hard drives.

fsck.repair=yes – will make the fsck try to repair if it finds bad blocks when checking (be absolutely sure you know, what you're doing if passing this options)

The options can be also set via editing the GRUB boot screen, if you have physical access to the server and don't want to reload the grub loader and possibly make the machine unbootable on next boot.
 

8. Few more details on how /etc/fstab disk fsck check parameters values for Systemd Linux machines works

The "proper" way on systemd (if we can talk about proper way on Linux) to runs fsck for each filesystem that has a fsck is to pass number greater than 0 set in
/etc/fstab (last column in /etc/fstab), so make sure you edit your /etc/fstab if that's not the case.

The root partition should be set to 1 (first to be checked), while other partitions you want to be checked should be set to 2.

Example /etc/fstab:
 

# /etc/fstab: static file system information.

/dev/sda1  /      ext4  errors=remount-ro  0  1
/dev/sda5  /home  ext4  defaults           0  2

The values you can put here as a second number meaning is as follows:
0 – disabled, that is do not check filesystem
1 – partition with this PASS value has a higher priority and is checked first. This value is usually set to the root / partition
2 – partitions with this PASS value will be checked last

a) Check the produced log out of fsck

Unfortunately on the older versions of Linux distros with SystemV fsck log output might be not generated except on the physical console so if you have a kind of duplicator device physical tty on the display port of the server, you might capture some bad block reports or fixed errors messages, but if you don't you might just cross the fingers and hope that anything found FS irregularities was recovered.

On systemd Linux machines the fsck log should be produced either in /run/initramfs/fsck.log or some other location depending on the Linux distro and you should be able to see something from fsck inside /var/log/* logs:

# grep -rli fsck /var/log/*


Close it up

Having a system with failing disk is a really one of the worst sysadmin nightmares to get. The good news is that most of the cases we're prepared with some working backup or some work around stuff like the few steps explained to mitigate the amount of Read / Writes to hard disks on the failing machine HDDs. If the failing disk is a primary Linux filesystem all becomes even worse as every next reboot, you have no guarantee, whether the kernel / initrd or some of the other system components required to run the Core Linux system won't break up the normal boot. Thus one side changes on the hard drives is a risky business on ther other side, if you're in a situation where you have a mirror system or the failing system is just a Linux server installed without a Cluster pair, then this is not a big deal as you can guarantee at least one of the nodes still up, unning and serving. Still doing too much of operations with HDD is always a danger so the steps described, though in most cases leading to improvement on how the system behaves, the system should be considered totally unreliable and closely monitored not only by some monitoring stuff like Zabbix / Prometheus whatever but regularly check the systems state via normal SSH logins. It is important if you have some important datas or logs on the system that are not synchronized to a system node to copy them before doing any of the described operations. After all minimal is backuped, proceed to clear up everything that might be cleared up and still the machine to continue providing most of its functionalities, trigger fsck automatic HDD check on next reboot, reboot, check what is going on and monitor the machine from there on.

Hopefully the few described steps, has helped some sysadmin. There is plenty of things which I've described that might go wrong, even following the described steps, might not help if the machines Storage Drives / SAS / SSD has too much of a damage. But as said in most cases following this few steps would improve the machine state.

Wish you the best of luck!