Posts Tagged ‘downtime’

How to RPM update Hypervisors and Virtual Machines running Haproxy High Availability cluster on KVM, Virtuozzo without a downtime on RHEL / CentOS Linux

Friday, May 20th, 2022

virtuozzo-kvm-virtual-machines-and-hypervisor-update-manual-haproxy-logo


Here is the scenario, lets say you have on your daily task list two Hypervisor (HV) hosts running CentOS or RHEL Linux with KVM or Virutozzo technology and inside the HV hosts you have configured at least 2 pairs of virtual machines one residing on HV Host 1 and one residing on HV Host 2 and you need to constantly keep the hosts to the latest distribution major release security patchset.

The Virtual Machines has been running another set of Redhat Linux or CentOS configured to work in a High Availability Cluster running Haproxy / Apache / Postfix or any other kind of HA solution on top of corosync / keepalived or whatever application cluster scripts Free or Open Source technology that supports a switch between clustered Application nodes.

The logical question comes how to keep up the CentOS / RHEL Machines uptodate without interfering with the operations of the Applications running on the cluster?

Assuming that the 2 or more machines are configured to run in Active / Passive App member mode, e.g. one machine is Active at any time and the other is always Passive, a switch is possible between the Active and Passive node.

HAProxy--Load-Balancer-cluster-2-nodes-your-Servers

In this article I'll give a simple step by step tested example on how you I succeeded to update (for security reasons) up to the latest available Distribution major release patchset on one by one first the Clustered App on Virtual Machines 1 and VM2 on Linux Hypervisor Host 1. Then the App cluster VM 1 / VM 2 on Hypervisor Host 2.
And finally update the Hypervisor1 (after moving the Active resources from it to Hypervisor2) and updating the Hypervisor2 after moving the App running resources back on HV1.
I know the procedure is a bit monotonic but it tries to go through everything step by step to try to mitigate any possible problems. In case of failure of some rpm dependencies during yum / dnf tool updates you can always revert to backups so in anyways don't forget to have a fully functional backup of each of the HV hosts and the VMs somewhere on a separate machine before proceeding further, any possible failures due to following my aritcle literally is your responsibility 🙂

 

0. Check situation before the update on HVs / get VM IDs etc.

Check the virsion of each of the machines to be updated both Hypervisor and Hosted VMs, on each machine run:
 

# cat /etc/redhat-release
CentOS Linux release 7.9.2009 (Core)


The machine setup I'll be dealing with is as follows:
 

hypervisor-host1 -> hypervisor-host1.fqdn.com 
•    virt-mach-centos1
•    virt-machine-zabbix-proxy-centos (zabbix proxy)

hypervisor-host2 -> hypervisor-host2.fqdn.com
•    virt-mach-centos2
•    virt-machine-zabbix2-proxy-centos (zabbix proxy)

To check what is yours check out with virsh cmd –if on KVM or with prlctl if using Virutozzo, you should get something like:

[root@hypervisor-host2 ~]# virsh list
 Id Name State
—————————————————-
 1 vm-host1 running
 2 virt-mach-centos2 running

 # virsh list –all

[root@hypervisor-host1 ~]# virsh list
 Id Name State
—————————————————-
 1 vm-host2 running
 3 virt-mach-centos1 running

[root@hypervisor-host1 ~]# prlctl list
UUID                                    STATUS       IP_ADDR         T  NAME
{dc37c201-08c9-589d-aa20-9386d63ce3f3}  running      –               VM virt-mach-centos1
{76e8a5f8-caa8-5442-830e-aa4bfe8d42d9}  running      –               VM vm-host2
[root@hypervisor-host1 ~]#

If you have stopped VMs with Virtuozzo to list the stopped ones as well.
 

# prlctl list -a

[root@hypervisor-host2 74a7bbe8-9245-5385-ac0d-d10299100789]# vzlist -a
                                CTID      NPROC STATUS    IP_ADDR         HOSTNAME
[root@hypervisor-host2 74a7bbe8-9245-5385-ac0d-d10299100789]# prlctl list
UUID                                    STATUS       IP_ADDR         T  NAME
{92075803-a4ce-5ec0-a3d8-9ee83d85fc76}  running      –               VM virt-mach-centos2
{74a7bbe8-9245-5385-ac0d-d10299100789}  running      –               VM vm-host1

# prlctl list -a


If due to Virtuozzo version above command does not return you can manually check in the VM located folder, VM ID etc.
 

[root@hypervisor-host2 vmprivate]# ls
74a7bbe8-9245-4385-ac0d-d10299100789  92075803-a4ce-4ec0-a3d8-9ee83d85fc76
[root@hypervisor-host2 vmprivate]# pwd
/vz/vmprivate
[root@hypervisor-host2 vmprivate]#


[root@hypervisor-host1 ~]# ls -al /vz/vmprivate/
total 20
drwxr-x—. 5 root root 4096 Feb 14  2019 .
drwxr-xr-x. 7 root root 4096 Feb 13  2019 ..
drwxr-x–x. 4 root root 4096 Feb 18  2019 1c863dfc-1deb-493c-820f-3005a0457627
drwxr-x–x. 4 root root 4096 Feb 14  2019 76e8a5f8-caa8-4442-830e-aa4bfe8d42d9
drwxr-x–x. 4 root root 4096 Feb 14  2019 dc37c201-08c9-489d-aa20-9386d63ce3f3
[root@hypervisor-host1 ~]#


Before doing anything with the VMs, also don't forget to check the Hypervisor hosts has enough space, otherwise you'll get in big troubles !
 

[root@hypervisor-host2 vmprivate]# df -h
Filesystem                       Size  Used Avail Use% Mounted on
/dev/mapper/centos_hypervisor-host2-root   20G  1.8G   17G  10% /
devtmpfs                          20G     0   20G   0% /dev
tmpfs                             20G     0   20G   0% /dev/shm
tmpfs                             20G  2.0G   18G  11% /run
tmpfs                             20G     0   20G   0% /sys/fs/cgroup
/dev/sda1                        992M  159M  766M  18% /boot
/dev/mapper/centos_hypervisor-host2-home  9.8G   37M  9.2G   1% /home
/dev/mapper/centos_hypervisor-host2-var   9.8G  355M  8.9G   4% /var
/dev/mapper/centos_hypervisor-host2-vz    755G   25G  692G   4% /vz

 

[root@hypervisor-host1 ~]# df -h
Filesystem               Size  Used Avail Use% Mounted on
/dev/mapper/centos-root   50G  1.8G   45G   4% /
devtmpfs                  20G     0   20G   0% /dev
tmpfs                     20G     0   20G   0% /dev/shm
tmpfs                     20G  2.1G   18G  11% /run
tmpfs                     20G     0   20G   0% /sys/fs/cgroup
/dev/sda2                992M  153M  772M  17% /boot
/dev/mapper/centos-home  9.8G   37M  9.2G   1% /home
/dev/mapper/centos-var   9.8G  406M  8.9G   5% /var
/dev/mapper/centos-vz    689G   12G  643G   2% /vz

Another thing to do before proceeding with update is to check and tune if needed the amount of CentOS repositories used, before doing anything with yum.
 

[root@hypervisor-host2 yum.repos.d]# ls -al
total 68
drwxr-xr-x.   2 root root  4096 Oct  6 13:13 .
drwxr-xr-x. 110 root root 12288 Oct  7 11:13 ..
-rw-r–r–.   1 root root  4382 Mar 14  2019 CentOS7.repo
-rw-r–r–.   1 root root  1664 Sep  5  2019 CentOS-Base.repo
-rw-r–r–.   1 root root  1309 Sep  5  2019 CentOS-CR.repo
-rw-r–r–.   1 root root   649 Sep  5  2019 CentOS-Debuginfo.repo
-rw-r–r–.   1 root root   314 Sep  5  2019 CentOS-fasttrack.repo
-rw-r–r–.   1 root root   630 Sep  5  2019 CentOS-Media.repo
-rw-r–r–.   1 root root  1331 Sep  5  2019 CentOS-Sources.repo
-rw-r–r–.   1 root root  6639 Sep  5  2019 CentOS-Vault.repo
-rw-r–r–.   1 root root  1303 Mar 14  2019 factory.repo
-rw-r–r–.   1 root root   666 Sep  8 10:13 openvz.repo
[root@hypervisor-host2 yum.repos.d]#

 

[root@hypervisor-host1 yum.repos.d]# ls -al
total 68
drwxr-xr-x.   2 root root  4096 Oct  6 13:13 .
drwxr-xr-x. 112 root root 12288 Oct  7 11:09 ..
-rw-r–r–.   1 root root  1664 Sep  5  2019 CentOS-Base.repo
-rw-r–r–.   1 root root  1309 Sep  5  2019 CentOS-CR.repo
-rw-r–r–.   1 root root   649 Sep  5  2019 CentOS-Debuginfo.repo
-rw-r–r–.   1 root root   314 Sep  5  2019 CentOS-fasttrack.repo
-rw-r–r–.   1 root root   630 Sep  5  2019 CentOS-Media.repo
-rw-r–r–.   1 root root  1331 Sep  5  2019 CentOS-Sources.repo
-rw-r–r–.   1 root root  6639 Sep  5  2019 CentOS-Vault.repo
-rw-r–r–.   1 root root  1303 Mar 14  2019 factory.repo
-rw-r–r–.   1 root root   300 Mar 14  2019 obsoleted_tmpls.repo
-rw-r–r–.   1 root root   666 Sep  8 10:13 openvz.repo


1. Dump VM definition XMs (to have it in case if it gets wiped during update)

There is always a possibility that something will fail during the update and you might be unable to restore back to the old version of the Virtual Machine due to some config misconfigurations or whatever thus a very good idea, before proceeding to modify the working VMs is to use KVM's virsh and dump the exact set of XML configuration that makes the VM roll properly.

To do so:
Check a little bit up in the article how we have listed the IDs that are part of the directory containing the VM.
 

[root@hypervisor-host1 ]# virsh dumpxml (Id of VM virt-mach-centos1 ) > /root/virt-mach-centos1_config_bak.xml
[root@hypervisor-host2 ]# virsh dumpxml (Id of VM virt-mach-centos2) > /root/virt-mach-centos2_config_bak.xml

 


2. Set on standby virt-mach-centos1 (virt-mach-centos1)

As I'm upgrading two machines that are configured to run an haproxy corosync cluster, before proceeding to update the active host, we have to switch off
the proxied traffic from node1 to node2, – e.g. standby the active node, so the cluster can move up the traffic to other available node.
 

[root@virt-mach-centos1 ~]# pcs cluster standby virt-mach-centos1


3. Stop VM virt-mach-centos1 & backup on Hypervisor host (hypervisor-host1) for VM1

Another prevention step to make sure you don't get into damaged VM or broken haproxy cluster after the upgrade is to of course backup 

 

[root@hypervisor-host1 ]# prlctl backup virt-mach-centos1

or
 

[root@hypervisor-host1 ]# prlctl stop virt-mach-centos1
[root@hypervisor-host1 ]# cp -rpf /vz/vmprivate/dc37c201-08c9-489d-aa20-9386d63ce3f3 /vz/vmprivate/dc37c201-08c9-489d-aa20-9386d63ce3f3-bak
[root@hypervisor-host1 ]# tar -czvf virt-mach-centos1_vm_virt-mach-centos1.tar.gz /vz/vmprivate/dc37c201-08c9-489d-aa20-9386d63ce3f3

[root@hypervisor-host1 ]# prlctl start virt-mach-centos1


4. Remove package version locks on all hosts

If you're using package locking to prevent some other colleague to not accidently upgrade the machine (if multiple sysadmins are managing the host), you might use the RPM package locking meachanism, if that is used check RPM packs that are locked and release the locking.

+ List actual list of locked packages

[root@hypervisor-host1 ]# yum versionlock list  

…..
0:libtalloc-2.1.16-1.el7.*
0:libedit-3.0-12.20121213cvs.el7.*
0:p11-kit-trust-0.23.5-3.el7.*
1:quota-nls-4.01-19.el7.*
0:perl-Exporter-5.68-3.el7.*
0:sudo-1.8.23-9.el7.*
0:libxslt-1.1.28-5.el7.*
versionlock list done
                          

+ Clear the locking            

# yum versionlock clear                               


+ List actual list / == clear all entries
 

[root@virt-mach-centos2 ]# yum versionlock list; yum versionlock clear
[root@virt-mach-centos1 ]# yum versionlock list; yum versionlock clear
[root@hypervisor-host1 ~]# yum versionlock list; yum versionlock clear
[root@hypervisor-host2 ~]# yum versionlock list; yum versionlock clear


5. Do yum update virt-mach-centos1


For some clarity if something goes wrong, it is really a good idea to make a dump of the basic packages installed before the RPM package update is initiated,
The exact versoin of RHEL or CentOS as well as the list of locked packages, if locking is used.

Enter virt-mach-centos1 (ssh virt-mach-centos1) and run following cmds:
 

# cat /etc/redhat-release  > /root/logs/redhat-release-$(hostname)-$(date '+%Y-%m-%d_%H-%M-%S').out
# cat /etc/grub.d/30_os-prober > /root/logs/grub2-efi-$(hostname)-$(date '+%Y-%m-%d_%H-%M-%S').out


+ Only if needed!!
 

# yum versionlock clear
# yum versionlock list


Clear any previous RPM packages – careful with that as you might want to keep the old RPMs, if unsure comment out below line
 

# yum clean all |tee /root/logs/yumcleanall-$(hostname)-$(date '+%Y-%m-%d_%H-%M-%S').out

 

Proceed with the update and monitor closely the output of commands and log out everything inside files using a small script that you should place under /root/status the script is given at the end of the aritcle.:
 

yum check-update |tee /root/logs/yumcheckupdate-$(hostname)-$(date '+%Y-%m-%d_%H-%M-%S').out
yum check-update | wc -l
yum update |tee /root/logs/yumupdate-$(hostname)-$(date '+%Y-%m-%d_%H-%M-%S').out
sh /root/status |tee /root/logs/status-before-$(hostname)-$(date '+%Y-%m-%d_%H-%M-%S').out

 

6. Check if everything is running fine after upgrade

Reboot VM
 

# shutdown -r now


7. Stop VM virt-mach-centos2 & backup  on Hypervisor host (hypervisor-host2)

Same backup step as prior 

# prlctl backup virt-mach-centos2


or
 

# prlctl stop virt-mach-centos2
# cp -rpf /vz/vmprivate/92075803-a4ce-4ec0-a3d8-9ee83d85fc76 /vz/vmprivate/92075803-a4ce-4ec0-a3d8-9ee83d85fc76-bak
## tar -czvf virt-mach-centos2_vm_virt-mach-centos2.tar.gz /vz/vmprivate/92075803-a4ce-4ec0-a3d8-9ee83d85fc76

# prctl start virt-mach-centos2


8. Do yum update on virt-mach-centos2

Log system state, before the update
 

# cat /etc/redhat-release  > /root/logs/redhat-release-vorher-$(hostname)-$(date '+%Y-%m-%d_%H-%M-%S').out
# cat /etc/grub.d/30_os-prober > /root/logs/grub2-efi-vorher-$(hostname)-$(date '+%Y-%m-%d_%H-%M-%S').out
# yum versionlock clear == if needed!!
# yum versionlock list

 

Clean old install update / packages if required
 

# yum clean all |tee /root/logs/yumcleanall-$(hostname)-$(date '+%Y-%m-%d_%H-%M-%S').out


Initiate the update

# yum check-update |tee /root/logs/yumcheckupdate-$(hostname)-$(date '+%Y-%m-%d_%H-%M-%S').out 2>&1
# yum check-update | wc -l 
# yum update |tee /root/logs/yumupdate-$(hostname)-$(date '+%Y-%m-%d_%H-%M-%S').out 2>&1
# sh /root/status |tee /root/logs/status-before-$(hostname)-$(date '+%Y-%m-%d_%H-%M-%S').out


9. Check if everything is running fine after upgrade
 

Reboot VM
 

# shutdown -r now

 

10. Stop VM vm-host2 & backup
 

# prlctl backup vm-host2


or

# prlctl stop vm-host2

Or copy the actual directory containig the Virtozzo VM (use the correct ID)
 

# cp -rpf /vz/vmprivate/76e8a5f8-caa8-5442-830e-aa4bfe8d42d9 /vz/vmprivate/76e8a5f8-caa8-5442-830e-aa4bfe8d42d9-bak
## tar -czvf vm-host2.tar.gz /vz/vmprivate/76e8a5f8-caa8-4442-830e-aa5bfe8d42d9

# prctl start vm-host2


11. Do yum update vm-host2
 

# cat /etc/redhat-release  > /root/logs/redhat-release-vorher-$(hostname)-$(date '+%Y-%m-%d_%H-%M-%S').out
# cat /etc/grub.d/30_os-prober > /root/logs/grub2-efi-vorher-$(hostname)-$(date '+%Y-%m-%d_%H-%M-%S').out


Clear only if needed

# yum versionlock clear
# yum versionlock list
# yum clean all |tee /root/logs/yumcleanall-$(hostname)-$(date '+%Y-%m-%d_%H-%M-%S').out


Do the rpm upgrade

# yum check-update |tee /root/logs/yumcheckupdate-$(hostname)-$(date '+%Y-%m-%d_%H-%M-%S').out
# yum check-update | wc -l
# yum update |tee /root/logs/yumupdate-$(hostname)-$(date '+%Y-%m-%d_%H-%M-%S').out
# sh /root/status |tee /root/logs/status-before-$(hostname)-$(date '+%Y-%m-%d_%H-%M-%S').out


12. Check if everything is running fine after upgrade
 

Reboot VM
 

# shutdown -r now


13. Do yum update hypervisor-host2

 

 

# cat /etc/redhat-release  > /root/logs/redhat-release-vorher-$(hostname)-$(date '+%Y-%m-%d_%H-%M-%S').out
# cat /etc/grub.d/30_os-prober > /root/logs/grub2-efi-vorher-$(hostname)-$(date '+%Y-%m-%d_%H-%M-%S').out

Clear lock   if needed

# yum versionlock clear
# yum versionlock list
# yum clean all |tee /root/logs/yumcleanall-$(hostname)-$(date '+%Y-%m-%d_%H-%M-%S').out


Update rpms
 

# yum check-update |tee /root/logs/yumcheckupdate-$(hostname)-$(date '+%Y-%m-%d_%H-%M-%S').out 2>&1
# yum check-update | wc -l
# yum update |tee /root/logs/yumupdate-$(hostname)-$(date '+%Y-%m-%d_%H-%M-%S').out 2>&1
# sh /root/status |tee /root/logs/status-before-$(hostname)-$(date '+%Y-%m-%d_%H-%M-%S').out


14. Stop VM vm-host1 & backup


Some as ealier
 

# prlctl backup vm-host1

or
 

# prlctl stop vm-host1

# cp -rpf /vz/vmprivate/74a7bbe8-9245-4385-ac0d-d10299100789 /vz/vmprivate/74a7bbe8-9245-4385-ac0d-d10299100789-bak
# tar -czvf vm-host1.tar.gz /vz/vmprivate/74a7bbe8-9245-4385-ac0d-d10299100789

# prctl start vm-host1


15. Do yum update vm-host2
 

# cat /etc/redhat-release  > /root/logs/redhat-release-vorher-$(hostname)-$(date '+%Y-%m-%d_%H-%M-%S').out
# cat /etc/grub.d/30_os-prober > /root/logs/grub2-efi-vorher-$(hostname)-$(date '+%Y-%m-%d_%H-%M-%S').out
# yum versionlock clear == if needed!!
# yum versionlock list
# yum clean all |tee /root/logs/yumcleanall-$(hostname)-$(date '+%Y-%m-%d_%H-%M-%S').out
# yum check-update |tee /root/logs/yumcheckupdate-$(hostname)-$(date '+%Y-%m-%d_%H-%M-%S').out
# yum check-update | wc -l
# yum update |tee /root/logs/yumupdate-$(hostname)-$(date '+%Y-%m-%d_%H-%M-%S').out
# sh /root/status |tee /root/logs/status-before-$(hostname)-$(date '+%Y-%m-%d_%H-%M-%S').out


16. Check if everything is running fine after upgrade

+ Reboot VM

# shutdown -r now


17. Do yum update hypervisor-host1

Same procedure for HV host 1 

# cat /etc/redhat-release  > /root/logs/redhat-release-vorher-$(hostname)-$(date '+%Y-%m-%d_%H-%M-%S').out
# cat /etc/grub.d/30_os-prober > /root/logs/grub2-efi-vorher-$(hostname)-$(date '+%Y-%m-%d_%H-%M-%S').out

Clear lock
 

# yum versionlock clear
# yum versionlock list
# yum clean all |tee /root/logs/yumcleanall-$(hostname)-$(date '+%Y-%m-%d_%H-%M-%S').out

# yum check-update |tee /root/logs/yumcheckupdate-$(hostname)-$(date '+%Y-%m-%d_%H-%M-%S').out
# yum check-update | wc -l
# yum update |tee /root/logs/yumupdate-$(hostname)-$(date '+%Y-%m-%d_%H-%M-%S').out
# sh /root/status |tee /root/logs/status-before-$(hostname)-$(date '+%Y-%m-%d_%H-%M-%S').out


18. Check if everything is running fine after upgrade

Reboot VM
 

# shutdown -r now


Check hypervisor-host1 all VMs run as expected 


19. Check if everything is running fine after upgrade

Reboot VM
 

# shutdown -r now


Check hypervisor-host2 all VMs run as expected afterwards


20. Check once more VMs and haproxy or any other contained services in VMs run as expected

Login to hosts and check processes and logs for errors etc.
 

21. Haproxy Unstandby virt-mach-centos1

Assuming that the virt-mach-centos1 and virt-mach-centos2 are running a Haproxy / corosync cluster you can try to standby node1 and check the result
hopefully all should be fine and traffic should come to host node2.

[root@virt-mach-centos1 ~]# pcs cluster unstandby virt-mach-centos1


Monitor logs and make sure HAproxy works fine on virt-mach-centos1


22. If necessery to redefine VMs (in case they disappear from virsh) or virtuosso is not working

[root@virt-mach-centos1 ]# virsh define /root/virt-mach-centos1_config_bak.xml
[root@virt-mach-centos1 ]# virsh define /root/virt-mach-centos2_config_bak.xml


23. Set versionlock to RPMs to prevent accident updates and check OS version release

[root@virt-mach-centos2 ]# yum versionlock \*
[root@virt-mach-centos1 ]# yum versionlock \*
[root@hypervisor-host1 ~]# yum versionlock \*
[root@hypervisor-host2 ~]# yum versionlock \*

[root@hypervisor-host2 ~]# cat /etc/redhat-release 
CentOS Linux release 7.8.2003 (Core)

Other useful hints

[root@hypervisor-host1 ~]# virsh console dc37c201-08c9-489d-aa20-9386d63ce3f3
Connected to domain virt-mach-centos1
..

! Compare packages count before the upgrade on each of the supposable identical VMs and HVs – if there is difference in package count review what kind of packages are different and try to make the machines to look as identical as possible  !

Packages to update on hypervisor-host1 Count: XXX
Packages to update on hypervisor-host2 Count: XXX
Packages to update virt-mach-centos1 Count: – 254
Packages to update virt-mach-centos2 Count: – 249

The /root/status script

+++

#!/bin/sh
echo  '=======================================================   '
echo  '= Systemctl list-unit-files –type=service | grep enabled '
echo  '=======================================================   '
systemctl list-unit-files –type=service | grep enabled

echo  '=======================================================   '
echo  '= systemctl | grep ".service" | grep "running"            '
echo  '=======================================================   '
systemctl | grep ".service" | grep "running"

echo  '=======================================================   '
echo  '= chkconfig –list                                        '
echo  '=======================================================   '
chkconfig –list

echo  '=======================================================   '
echo  '= netstat -tulpn                                          '
echo  '=======================================================   '
netstat -tulpn

echo  '=======================================================   '
echo  '= netstat -r                                              '
echo  '=======================================================   '
netstat -r


+++

That's all folks, once going through the article, after some 2 hours of efforts or so you should have an up2date machines.
Any problems faced or feedback is mostly welcome as this might help others who have the same setup.

Thanks for reading me 🙂

Automatic network restart and reboot Linux server script if ping timeout to gateway is not responding as a way to reduce connectivity downtimes

Monday, December 10th, 2018

automatic-server-network-restart-and-reboot-script-if-connection-to-server-gateway-inavailable-tux-penguing-ascii-art-bin-bash

Inability of server to come back online server automaticallyafter electricity / network outage

These days my home server  is experiencing a lot of issues due to Electricity Power Outages, a construction dig operations to fix / change waterpipe tubes near my home are in action and perhaps the power cables got ruptered by the digger machine.
The effect of all this was that my server networking accessability was affected and as I didn't have network I couldn't access it remotely anymore at a certain point the electricity was restored (and the UPS charge could keep the server up), however the server accessibility did not due restore until I asked a relative to restart it or under a more complicated cases where Tech aquanted guy has to help – Alexander (Alex) a close friend from school years check his old site here – alex.www.pc-freak.net helps a lot.to restart the machine physically either run a quick restoration commands on root TTY terminal or generally do check whether default router is reachable.

This kind of Pc-Freak.net downtime issues over the last month become too frequent (the machine was down about 5 times for 2 to 5 hours and this was too much (and weirdly enough it was not accessible from the internet even after electricity network was restored and the only solution to that was a physical server restart (from the Power Button).

To decrease the number of cases in which known relatives or friends has to  physically go to the server and restart it, each time after network or electricity outage I wrote a small script to check accessibility towards Default defined Network Gateway for my server with few ICMP packages sent with good old PING command
and trigger a network restart and system reboot
(in case if the network restart does fail) in a row.

1. Create reboot-if-nwork-is-downsh script under /usr/sbin or other dir

Here is the script itself:

 

#!/bin/sh
# Script checks with ping 5 ICMP pings 10 times to DEF GW and if so
# triggers networking restart /etc/inid.d/networking restart
# Then does another 5 x 10 PINGS and if ping command returns errors,
# Reboots machine
# This script is useful if you run home router with Linux and you have
# electricity outages and machine doesn't go up if not rebooted in that case

GATEWAY_HOST='192.168.0.1';

run_ping () {
for i in $(seq 1 10); do
    ping -c 5 $GATEWAY_HOST
done

}

reboot_f () {
if [ $? -eq 0 ]; then
        echo "$(date "+%Y-%m-%d %H:%M:%S") Ping to $GATEWAY_HOST OK" >> /var/log/reboot.log
    else
    /etc/init.d/networking restart
        echo "$(date "+%Y-%m-%d %H:%M:%S") Restarted Network Interfaces:" >> /tmp/rebooted.txt
    for i in $(seq 1 10); do ping -c 5 $GATEWAY_HOST; done
    if [ $? -eq 0 ] && [ $(cat /tmp/rebooted.txt) -lt ‘5’ ]; then
         echo "$(date "+%Y-%m-%d %H:%M:%S") Ping to $GATEWAY_HOST FAILED !!! REBOOTING." >> /var/log/reboot.log
        /sbin/reboot

    # increment 5 times until stop
    [[ -f /tmp/rebooted.txt ]] || echo 0 > /tmp/rebooted.txt
    n=$(< /tmp/rebooted.txt)
        echo $(( n + 1 )) > /tmp/rebooted.txt
    fi
    # if 5 times rebooted sleep 30 mins and reset counter
    if [ $(cat /tmprebooted.txt) -eq ‘5’ ]; then
    sleep 1800
        cat /dev/null > /tmp/rebooted.txt
    fi
fi

}
run_ping;
reboot_f;

You can download a copy of reboot-if-nwork-is-down.sh script here.

As you see in script successful runs  as well as its failures are logged on server in /var/log/reboot.log with respective timestamp.
Also a counter to 5 is kept in /tmp/rebooted.txt, incremented on each and every script run (rebooting) if, the 5 times increment is matched

a sleep is executed for 30 minutes and the counter is being restarted.
The counter check to 5 guarantees the server will not get restarted if access to Gateway is not continuing for a long time to prevent the system is not being restarted like crazy all time.
 

2. Create a cron job to run reboot-if-nwork-is-down.sh every 15 minutes or so 

I've set the script to re-run in a scheduled (root user) cron job every 15 minutes with following  job:

To add the script to the existing cron rules without rewriting my old cron jobs and without tempering to use cronta -u root -e (e.g. do the cron job add in a non-interactive mode with a single bash script one liner had to run following command:

 

{ crontab -l; echo "*/15 * * * * /usr/sbin/reboot-if-nwork-is-down.sh 2>&1 >/dev/null; } | crontab –


I know restarting a server to restore accessibility is a stupid practice but for home-use or small client servers with unguaranteed networks with a cheap Uninterruptable Power Supply (UPS) devices it is useful.

Summary

Time will show how efficient such a  "self-healing script practice is.
Even though I'm pretty sure that even in a Corporate businesses and large Public / Private Hybrid Clouds where access to remote mounted NFS / XFS / ZFS filesystems are failing a modifications of the script could save you a lot of nerves and troubles and unhappy customers / managers screaming at you on the phone 🙂


I'll be interested to hear from others who have a better  ideas to restore ( resurrect ) access to inessible Linux server after an outage.?
 

Pc-Freak 2 days Downtime / Debian Linux Squeeze 32 bit i386 to amd64 hell / Expression of my great Thanks to Alex and my Sister

Tuesday, October 16th, 2012

Debian upgrade Squeeze Linux from 32 to 64 problems, don't try do it except you have physical access !!!

Recently for some UNKNOWN to ME reasons New Pc-Freak computer hardware crashed 2 times over last 2 weeks time, this was completely unexpected especially after the huge hardware upgrade of the system. Currently the system is equipped with 8GB of memory a a nice Dual Core Intel CPU running on CPU speed of 6 GHZ, however for completely unknown to me reasons it continued experience outages and mysteriously hang ups ….

So far I didn’t have the time to put some few documentary pictures of PC hardware on which this blog and the the rest of sites and shell access is running so I will use this post to do this as well:

Below I include a picture for sake of History preservation 🙂 of Old Pc-Freak hardware running on IBM ThinkCentre (1GB Memory, 3Ghz Intel CPU and 80 GB HDD):

IBM Desktop ThinkCentre old pc-freak hardware server PC

The old FreeBSD powered Pc-Freak IBM ThinkCentre

Here are 2 photos of new hardware host running on Lenovo ThinCentre Edge:

New Pc-Freak host hardware lenovo ThinkEdge Photo
New Pc-Freak host hardware Lenovo ThinkEdge Camera Photo
My guess was those unsual “freezes” were caused due to momentum overloads of WebServer or MySQL db.
Actually the Linux Squeeze installed was “stupidly” installed with a 32 bit Debian Linux (by me). I did that stupidity, just few weeks ago, when I moved every data content (SQL, Apache config, Qmail accounts, Shell accounts etc. etc.) from old Pc-Freak computer to the new purchased one.

After finding out I have improperly installed (being in a hurry) – 32 Bit system, I’ve Upgrade only the system 32 bit kernel hich doesn’t support well more than 4GB to an amd64 one supporting up to 64GB of memory – if interested I’ve prior blogged on this here.
Thanks to my dear friend Alexander (who in this case should have a title similar to Alexander the Great – for he did great and not let me down being there in such a difficult moment for me spending from his personal time helping me bringing up Pc-Freak.Net. To find a bit more about Alex you might check his personal home page hosted on www.pc-freak.net too here 🙂
I don’t exaggerate, really Alex did a lot for me and this is maybe the 10th time I disturb him over the last 2 years, so I owe him a lot ! Alex – I really owe you a lot bro – thanks for your great efforts; thanks for going home 3 times for just to days, thanks for recording Rescue CDs, staying at home until 2 A.M. and really thanks for all!!

Just to mention again, to let me via Secure Shell, Alex burned and booted for me Debian Linux Rescue Live CD downloaded from linke here.

This time I messed my tiny little home hosted server, very very badly!!! Those of you who might read my blog or have SSH accounts on Pc-Freak.NET, already should have figured out Pc-Freak.net was down for about 2 days time (48 HOURS!!!!).

The exact “official” downtime period was:

Saturday OCTOBER 13!!!( from around 16:00 o’clock – I’m not fatalist but this 13th was really a harsh date) until Monday 15-th of Oct (14:00h) ….

I’m completely in charge and responsible for the 2 days down time, and honestly I had one of my worst life days, so far. The whole SHIT story occurred after I attempted to do a 32 bit (i386) to AMD64 (64 bit) system packages deb binary upgrade; host is installed to run Debian Squeeze 6.0.5 ….; Note to make here is Officially according to documentation package binary upgrades from 32 bit to 64 arch Debian Linux are not possible!. Official debian.org documentation recommended for 32 bit to 64 packs update (back up all system existent data) and do a clean CD install / re-install, over the old installed 32 bit version. However ignoring the official documentation, being unwise and stubborn, I decided to try to anyways upgrading using those Dutch person guide … !!!

I’ve literally followed above Dutch guy, steps and instead of succeeding 64 bit update, after few of the steps outlined in his article the node completely (libc – library to which all libraries are linked) broke up. Then trying to fix those amd64 libc, I tried re-installing coreutils package part of base-files – basis libs and bins deb;
I’ve followed few tutorials (found on the next instructing on the 32bit to 64 bit upgrade), combined chunks from them, reloaded libc in a live system !!! (DON’T TRY THAT EVER!); then by mistake during update deleted coreutils package!!!, leaving myself without even essential command tools like /bin/ls , /bin/cp etc. etc. ….. And finally very much (in my fashion) to make the mess complete I decided to restart the system in those state without /bin/ls and all essential /bins ….
Instead of making things better I made the system completely un-bootable 🙁

Well to conclude it, here I am once again I stupid enough not to follow the System Administrator Golden Rule of Thumb:

IF SOMETHING WORKS DON’T TOUCH IT !!!!!!!!! EVER !!!!, cause of my stubbornness I screw it up all so badly.
I should really take some moral from this event, as similar stories has happened to me long time ago on few Fedora Linux hosts on productive Web servers, and I went through all this upgrades nightmare but apparently learned nothing from it. My personal moral out of the story is I NEVER LEARN FROM MY MISTAKES!!! PFFF …

I haven’t had days like this in which I was totally down, for a very long time, really I fell in severe desperation and even depressed, after un-abling to access in any way Pc-Freak.NET, I even thought it will be un-fixable forever and I will loose all data on the host and this deeply saddened me.
Here is good time to Give thanks to Svetlana (Sveta) (A lovely kind, very beautiful Belarusian lady 🙂 who supported me and Sali and his wife Mimi (Meleha) who encouraged and lived up my hardly bearable tempper when angry or/and sad :)). Lastly I have to thank a lot to Happy (Indian Lady whose whose my dear indian brother Jose met me with in Skype earlier. Happy encouraged me in many times of trouble in Skype, giving me wise advices not to take all so serious and be more confied, also most importantly Happy helped me with her prayers …. Probably many others to which I complained about situation helped with their prayers too – Thanks to to God and to all and let God return them blessing according to their good prayers for me !

Some people who know me well might know Pc-Freak.Net Linux host has very sentimental value for me and even though it doesn’t host too much websites (only 38 sites not so important ones ), still it is very bad to know your “work input” which you worked on in your spare time over the last 3 years (including my BLOG – blogging almost every day for last 3 yrs, the public shell SSH access for my Friends, custom Qmail Mail server / POP3 and IMAP services / SQL data etc. might not be lost forever. Or in more positive better scenario could be down for huge period of time like few months until I go home and fix it physically on phys terminal …

All this downtime mess occurred due to my own inability to estimate properly update risks (obviously showing how bad I’m in risk management …). Whole “down time story”” proofed me only, I have a lot to learn in life and worry less about things ….
It also show me how much of an “idol”, one can make some kind of object of daily works as www.pc-freak.net become to me. Good thing is I at least realize my blog has with time, become like an idol to me as I’m mostly busy with it and in a way too much worrying for it makes me fill up in the gap “worshipping an idol” and each Christian knows pretty well, God tells us: “Do not have other Gods besides me”.

I suppose this whole mess was allowed to happen by God’s Great Mercy to show me how weak my faith is, and how often I put my personal interest on top of real important things. Whole situation teached me, once again I easy fall in spirit and despair; hope it is a lesson given to me I will learn from and next time I will be more solid in critical situation …

Here are some of my thoughts on the downtime, as I felt obliged to express them too;

Whole problem severeness (in my mind), would not be so bad if I only had some kind of physical access to System terminal. However as I’m currently in Arnhem Holland 6500 kilometers away from the Server (hosted in Dobrich, Bulgaria), don’t have access to IPKVM or any kind of web management to act on the physical keyboard input, my only option was to ask Alex go home and tell him act as a pro tech support which though I repeat myself I will say again, he did great.
What made this whole downtime mess even worser in my distorted vision on situation is, fact; I don’t know people who are Linux GURUs who can deal with the situation and fix the host without me being physically there, so this even exaggerated me worrying it even more …

I’m relatively poor person and I couldn’t easily afford to buy a flight ticket back to Bulgaria which in best case as I checked today in WizzAir.com’s website would costs me about 90EUR (at best – just one way flight ticket ) to Sofia and then more 17 euro for bus ticket from Sofia to Dobrich; Meaning whole repair costs would be no less than 250 EUR with prince included train ticket expenses to Eindhoven.);

Therefore obviously traveling back to fix it on physical console was not an option.
Some other options I considered (as adviced by Sveta), was hiring some (pro sysadm to fix the host) – here I should say it is almost impossible to find person in Dobrich who has the Linux knowledge to fix the system; moreover Linux system administrators are so expensive these days. Most pro sysadmins will not bother to fix the host if not being paid hour – fee of at least 40 / 50 EUR. Obviously therefore hiring a professional UNIX system adminsitrator to solve my system issues would have cost approximately equal to travel expenses of myself, if going physically to the computer; spend the same 5 hours fixing it and loose at least 2 or 3 more days in traveling back to Holland …..
Also it is good to mention on the system, I’ve done a lot of custom things, which an external hired person will be hardly possible to deal with, without my further interference and even if I had hired someone to fix it I would have spend at least 50 euro on Phone Bills to explain specifics ….

As I was in the shit, I should thanks in this post also (on first place) to MY DEAR SISTER Stanimira !!! My sis was smart enough to call my dear friend Alexander (Alex), who as always didn’t fail me – for a 3rd time BIG THANKS ALEX !, spending time and having desire to help me at this critical times. I instructed him as a first step to try loading on the unbootable linux, the usual boot-able Debian Squeeze Install LiveCD….
So far so good, but unfortunately with this bootable CD, the problem is Debian Setup (Install) CD does not come equipped with SSHD (SSH Server) by default and hence I can’t just get in via Internet;
I’ve searched through the net if there is a way to make the default Debian Install CD1 (.iso) recovery CD to have openssh-server enabled, but couldn’t find anyone explainig how ?? If there is some way and someone reading this post knows it please drop a comment ….

As some might know Debian Setup CD is running as its basis environment busybox; system tools there provided whether choosing boot the Recovery Console are good mostly for installing or re-installing Debian, but doesn’t include any way to allow one to do remote system recovery over SSH connection.

Further on, have instructed Alex, brought up the Network Interfacse on the system with ifconfig using cmds:


# /sbin/ifconfig MY_IP netmask 255.255.255.240
# /sbin/route add default gw MY_GATEWAY_IP;

BTW, I have previously blogged on how to bring network interfaces with ifconfig here
Though the LAN Interfaces were up after that and I could ping ($ ping www.pc-freak.net) this was of not much use, as I couldn’t log in. Neither somehow can access system in a chroot.
I did thoroughfully explained Alex, how to fix the un-chroot-table badly broken (mounted) system. ….
In order to have accessed the system via SSH, after a bit of research I’ve asked Alex to download and boot from the CD Drive Debian Linux based AMD64 Rescue CD available here ….

Using this much better rescue CD than default Debian Install CD1, thanks God, Alex was able to bring up a working sshd server.

To let me access the rescue CD, Alex changed root pass to a trivial one with usual:


# passwd root
....

Then finally I logged in on host via ssh. Since chroot over the mounted /vev/sda1 in /tmp/aaa was impossible due to a missing working /bin/bash – Here just try imagine how messed up this system was!!!, I asked Alex to copy over the basic system files from the Rescue CD with cp copy command within /tmp/aaa/. The commands I asked him to execute to override some of the old messed up Linux files were:


# cp -rpf /lib/* /tmp/aaa/lib
# cp -rpf /usr/lib/* /tmp/aaa/usr/lib
# cp -rpf /lib32/* /tmp/aaa/lib32
# cp -rpf /bin/* /tmp/aaa/bin
# cp -rpf /usr/lib64/* /tmp/aaa/usr/lib64
# cp -rpf /sbin/* /tmp/aaa/sbin
# cp -rpf /usr/sbin/* /tmp/aaa/usr/sbin

After this at least chroot /tmp/aaa worked!! Thanks God!

I also said Alex to try bootstrap to install a base debian system files inside the broken /tmp/aaa, but this didn’t make things better (so I’m not sure if debootstrap helped or made things worse)??. Exact bootstrap command tried on the host was:


# debootstrap --arch amd64 squeeze /tmp/aaa http://ftp.us.debian.org/debian

This command as explained in Debian Wiki Debootstrap section is supposed to download and override basis Linux system with working base bins and libs.

After I logged in over ssh, I’ve entered chroot-ing and following instructions of 2 of my previous articles:

1. How to do proper chroot and recover broken Ubuntu using mount and chrooting

2. How to mount /proc and /dev and in chroot on Linux – for fail system recovery

Next on, after logging in via ssh I chrooted to mounted system;


# mount /dev/sda1 /mnt/aaa
# chroot /mnt/aaa

Inside chrooted environment, I tried running ssh server, listen on separate port 2208 with command:


# /usr/sbin/sshd -p 2208

sshd did not start up but spitted mer error: PRNG is not seeded, after reading a bit online I’ve found others experiencing PRNG is not seeded err in thread here

The PRNG is not seeded error is caused due to a missing /dev/urandom inside the chroot-ed environment:


# ls -al /dev/urandom
ls: cannot access /dev/urandom: No such file or directory

To solve it, one has to create /dev/urandom with mknod command:


# mknod /dev/urandom c 1 9

….

Something else worthy to mention is very helpful post found on noah.org explaining few basic things on apt, aptitude and dpkg which helped me over the whole severe failed dependency apt-get issues experienced inside chroot.

Inside the chroot, I tried using few usual apt-get cmds to solve the multiple appearing broken packages inter-dependency. I tried:


# apt-get update
....
# apt-get --yes upgrade
# apt-get -f install

Even before that apt, package was broken, so I instructed Alex, to download me one from a web link. By mistake I gave him, a Debian Etch apt version instead of Debian Squeze. So using once again dpkg -i apt* after downloading the latest stable apt deb binaries from debian.org, I had to re-install apt-get…

Besides that Alex, had copied a bunch of libraries, straight copied from my notebook running amd64 Debian Squeeze and has to place all this transferred binaries in /mnt/aaa/{lib,usr/lib} in order to solve missing libraries for proper apt-get operation.

As it seemed slightly impossible fix the broken dependencies with apt-get, I first tried fixing failed inter-dependencies using the other automated dependency solver tool (written in perl language) aptitude. I tried with it solving the situation issuing:


# aptitute update
# aptitude safe-upgrade
# aptitude safe-upgrade --full-resolver

No of the above aptitude command options helped anyhow, so
I’ve decided to try the old but gold approach of combining common logic with a bit of shell scripting 🙂
Here is my customly invented approach 🙂 :

1. Inside the chroot, make a dump of all installed deb packages names in a file
2. Outside the chroot straight ssh-ing again to the Rescucd shell, use RescueCD apt-get to only download all amd64 binaries corresponding to dumped packages names
3. Move all downloaded only apt-get binaries from /var/cache/apt/archives to /mnt/aaa/var/cache/apt/archives
4. Inside chroot, run cd to /var/cache/apt/archives/ and use for bash loop to install each package with dpkg -i

Inside Chroot-ed environment chroot /tmp/aaa, dpkg – to dump list of all installed i386 previous packages on broken system:


# dpkg -l|awk '{ print $2 }' >> /mnt/aaa/root/all_deb_packages_list.txt

Thereon, I delete first 5 lines in beginning of file (2 empty lines) and 3 lines with content:


Status=Not/Inst/Conf-files/Unpacked/halF-conf/Half-inst/trig-aWait/Trig-pend
Err?=(none)/Reinst-required
Name

should be deleted.

Onwards outside of chroot-ed env, I downloaded all deb packages corresponding to previous ones in all_deb_packges.txt:


# mkdir /tmp/apt
# cd /tmp/apt
# for i in $(cat /mnt/aaa/root/all_deb_packages.txt; do \
apt-get --download-only install -yy $i \
....
.....
done

In a while after 30 / 40 minutes all amd64 .deb packages were downloaded in rescuecd /var/cache/apt/archives/.
/var/cache/apt/archives/ in LiveCDs is stored in system memory, thanksfully I have 8 Gigabytes of memory on the host so memory was more than enough to store all packs 😉
Once above loop, completed. I copied all debs to /mnt/aaa/var/cache/apt, i.e.:


# cp -vrpf /var/cache/apt/archives/*.deb /mnt/aaa/var/cache/apt/archives/

Then back in the (chroot-ed broken system), in another ssh session chroot /mnt/aaa, I run another shell loop aim-ing to install each copied deb package (below command should run after chroot-ing):


# cd /var/cache/apt/archives
# for i in *.deb; do \
dpkg -i $i
done

I had on the system installed Qmail server which was previously linked against old 32 bit installed libs, so in my case was also necessery rebuild qmail install as well as ucsp-tcp and ucsp-ssl, after rebooting and booting the finally working amd64 libs system (after reboot and proper boot!):

a) to Re-compile qmail base binaries, had to issue:


# qmailctl stop
# cd /usr/src/qmail
# make clean
# make man
# make setup check

b) to re-compile ucspi-tcp and ucspi-ssl:


# rm -rf /packages/ucspi-ssl-0.70.2/
#mkdir /packages
# chmod 1755 /packages
# cd /tmp
# tar -zxvf /downloads/ucspi-ssl-0.70.2.tar.gz
....
# mv /tmp/host/superscript.com/net/ucspi-ssl-0.70.2/ /packages
# cd /packages/ucspi-ssl-0.70.2/
# rm -rf /tmp/host/
# sed -i 's/local\///' src/conf-tcpbin
# sed -i 's/usr\/local/etc/' src/conf-cadir
# sed -i 's/usr\/local\/ssl\/pem/etc\/ssl/' src/conf-dhfile
# openssl dhparam -check -text -5 1024 -out /etc/ssl/dh1024.pem

Then had to stop temporary daemontools service, through commenting line in /etc/inittab:


# SV:123456:respawn:/usr/bin/svscanboot


# init q

After that remove commented line:


SV:123456:respawn:/usr/bin/svscanboot

and consequentually install ucsp-{tcp,ssl}:


# cd /packages/ucspi-ssl-0.70.2/
# package/compile
# package/rts
# package/install

c) Rebuild Courier-Imap and CourierImapSSL

As I have custom compiled Courier-IMAP and Courier-IMAPSSL it was necessery to rebuild Courier-imaps following steps earlier explained in this article

I have on the system running DjbDNS as local caching server so I had to also re-install djbdns, re-compiling it from source

Finally after restart the system booted OKAY!! Thanks God!!!!!! 🙂
Further on to check the boot-ed system runs 64 bit architecture dpkg should be used
To check if the system architecture is 64 now 64 bit, there is a command dpkg-architecture, as I learned from superuser.com forums thread here


root@pcfreak:~# dpkg-architecture -qDEB_HOST_ARCH
amd64

One more thing, which helped me a lot during the whole system recovery was main Debian deb HTTP repositories ftp.us.debian.org/debian/pool/ , I’ve downloaded apt (amd64 Squeeze) version and few other packages from there.
Hope this article helps someone who end up in 32 to 64 bit debian arch upgrade. Enjoy 🙂