Posts Tagged ‘emergency’

Haproxy Enable / Disable Application backend server configured to roundrobin in emergency case via haproxy socket command

Thursday, May 2nd, 2024

haproxy-stats-socket

Haproxy LB backend BACKEND_ROUNDROBIN are configured to roundrobin with check health check port  (check port 33333).
For example letsa say haproxy server is running with a haproxy_roundrobin.cfg like this one.

Under some circumstances however if check port TCP 33333 is UP, but behind 1 or more of Application that is providing the resources to customers misbehaves ,
(app-server1, app-server2, app-server3, app-server4) members , Load Balancer cannot know this, because traffic routing decision is made based on Echo port.

One example scenario when this can happen is if Application server has issue with connectivity towards Database hosts:
(db-host1, db-host2, db-host3, db-host4)

If this happens 25% of traffic might still get balanced to broken Application server. If such scenario happens during OnCall and this is identified as problem,
work around would be to temporary disable the misbehaving App servers member from the 4 configured roundrobin pairs in haproxyproduction.cfg :

For example if app-server3 App node is identified as failing and 25% via LB is lost, to resolve it until broken Application server node is fixed, you will have to temporary exclude it from the ring of roundrobin backend hosts.

1.  Check the status of haproxy backends

echo "show stat" | socat stdio /var/lib/haproxy/stats

As you can see the backend is disabled.

Another way to do it which will make your sessions to the server not directly cut but kept for some time is to put the server you want to exclude from haproxy roundrobin to "maintenace mode".

echo "set server bk_BACKEND_ROUNDROBIN/app-server3 state maint" | socat unix-connect:/var/lib/haproxy/stats stdio

Actually, there is even better and more advanced way to disable backend from a configured rounrobin pair of hosts, with putting the available connections in a long waiting queue in the proxy, and if the App host is inavailable for not too short, haproxy will just ask the remote client to keep the connection for longer and continue the session interaction to remote side and wait for the App server connectivity to go out of maintenance, this is done via "drain" option.

echo "set server bk_BACKEND_ROUNDROBIN/app-server3 state drain" | socat unix-connect:/var/lib/haproxy/stats stdio

 

  • This sets the backend in DRAIN mode. No new connections are accepted and existing connections are drained.

To get a better idea on what is drain state, here is excerpt from haproxy official documentation:

Force a server's administrative state to a new state. This can be useful to
disable load balancing and/or any traffic to a server. Setting the state to
"ready" puts the server in normal mode, and the command is the equivalent of
the "enable server" command. Setting the state to "maint" disables any traffic
to the server as well as any health checks. This is the equivalent of the
"disable server" command. Setting the mode to "drain" only removes the server
from load balancing but still allows it to be checked and to accept new
persistent connections. Changes are propagated to tracking servers if any.


2. Disable backend app-server3 from rounrobin 


 

echo "disable server BACKEND_ROUNDROBIN/app-server3" | socat unix-connect:/var/lib/haproxy/stats stdio

# pxname,svname,qcur,qmax,scur,smax,slim,stot,bin,bout,dreq,dresp,ereq,econ,eresp,wretr,wredis,status,weight,act,bck,chkfail,chkdown,lastchg,downtime,qlimit,pid,iid,sid,throttle,lbtot,tracked,type,rate,rate_lim,rate_max,check_status,check_code,check_duration,hrsp_1xx,hrsp_2xx,hrsp_3xx,hrsp_4xx,hrsp_5xx,hrsp_other,hanafail,req_rate,req_rate_max,req_tot,cli_abrt,srv_abrt,comp_in,comp_out,comp_byp,comp_rsp,lastsess,last_chk,last_agt,qtime,ctime,rtime,ttime,
stats,FRONTEND,,,0,0,3000,0,0,0,0,0,0,,,,,OPEN,,,,,,,,,1,2,0,,,,0,0,0,0,,,,0,0,0,0,0,0,,0,0,0,,,0,0,0,0,,,,,,,,
stats,BACKEND,0,0,0,0,300,0,0,0,0,0,,0,0,0,0,UP,0,0,0,,0,282917,0,,1,2,0,,0,,1,0,,0,,,,0,0,0,0,0,0,,,,,0,0,0,0,0,0,-1,,,0,0,0,0,
Frontend_Name,FRONTEND,,,0,0,3000,0,0,0,0,0,0,,,,,OPEN,,,,,,,,,1,3,0,,,,0,0,0,0,,,,,,,,,,,0,0,0,,,0,0,0,0,,,,,,,,
Backend_Name,app-server4,0,0,0,0,,0,0,0,,0,,0,0,0,0,UP,1,1,0,1,0,282917,0,,1,4,1,,0,,2,0,,0,L4OK,,12,,,,,,,0,,,,0,0,,,,,-1,,,0,0,0,0,
Backend_Name,app-server3,0,0,0,0,,0,0,0,,0,,0,0,0,0,MAINT,1,0,1,1,2,2,23,,1,4,2,,0,,2,0,,0,L4OK,,11,,,,,,,0,,,,0,0,,,,,-1,,,0,0,0,0,
Backend_Name,BACKEND,0,0,0,0,300,0,0,0,0,0,,0,0,0,0,UP,1,1,0,,0,282917,0,,1,4,0,,0,,1,0,,0,,,,,,,,,,,,,,0,0,0,0,0,0,-1,,,0,0,0,0,

Once it is confirmed from Application supprt colleagues, that machine is out of maintenance node and working properly again to reenable it:

3. Enable backend app-server3

echo "enable server bk_BACKEND_ROUNDROBIN/app-server3" | socat unix-connect:/var/lib/haproxy/stats stdio

4. Check backend situation again

echo "show stat" | socat stdio /var/lib/haproxy/stats
# pxname,svname,qcur,qmax,scur,smax,slim,stot,bin,bout,dreq,dresp,ereq,econ,eresp,wretr,wredis,status,weight,act,bck,chkfail,chkdown,lastchg,downtime,qlimit,pid,iid,sid,throttle,lbtot,tracked,type,rate,rate_lim,rate_max,check_status,check_code,check_duration,hrsp_1xx,hrsp_2xx,hrsp_3xx,hrsp_4xx,hrsp_5xx,hrsp_other,hanafail,req_rate,req_rate_max,req_tot,cli_abrt,srv_abrt,comp_in,comp_out,comp_byp,comp_rsp,lastsess,last_chk,last_agt,qtime,ctime,rtime,ttime,
stats,FRONTEND,,,0,0,3000,0,0,0,0,0,0,,,,,OPEN,,,,,,,,,1,2,0,,,,0,0,0,0,,,,0,0,0,0,0,0,,0,0,0,,,0,0,0,0,,,,,,,,
stats,BACKEND,0,0,0,0,300,0,0,0,0,0,,0,0,0,0,UP,0,0,0,,0,282955,0,,1,2,0,,0,,1,0,,0,,,,0,0,0,0,0,0,,,,,0,0,0,0,0,0,-1,,,0,0,0,0,
Frontend_Name,FRONTEND,,,0,0,3000,0,0,0,0,0,0,,,,,OPEN,,,,,,,,,1,3,0,,,,0,0,0,0,,,,,,,,,,,0,0,0,,,0,0,0,0,,,,,,,,
Backend_Name,app-server4,0,0,0,0,,0,0,0,,0,,0,0,0,0,UP,1,1,0,1,0,282955,0,,1,4,1,,0,,2,0,,0,L4OK,,12,,,,,,,0,,,,0,0,,,,,-1,,,0,0,0,0,
Backend_Name,app-server3,0,0,0,0,,0,0,0,,0,,0,0,0,0,UP,1,0,1,1,2,3,58,,1,4,2,,0,,2,0,,0,L4OK,,11,,,,,,,0,,,,0,0,,,,,-1,,,0,0,0,0,
Backend_Name,BACKEND,0,0,0,0,300,0,0,0,0,0,,0,0,0,0,UP,1,1,1,,0,282955,0,,1,4,0,,0,,1,0,,0,,,,,,,,,,,,,,0,0,0,0,0,0,-1,,,0,0,0,0,


You should see the backend enabled again.

NOTE:
If you happen to get some "permission denied" errors when you try to send haproxy commands via the configured haproxy status this might be related to the fact you have enabled the socket in read only mode, if that is so it means the haproxy cannot be written to and therefore you can only read info from it with status commands, but not send any write operations to haproxy via unix socket.

One example haproxy configuration that enables haproxy socket in read only looks like this in haproxy.cfg:
 

 stats socket /var/lib/haproxy/stats


To make the haproxy socket read / write mode, for root superuser and some other users belonging to admin group 'adm', you should set the haproxy.cfg to something like:

stats socket /var/lib/haproxy/stats-qa mode 0660 group adm level admin

or if no special users with a set admin group needed to have access to socket, use instead config like:

stats socket /var/lib/haproxy/stats-qa.sock mode 0600 level admin

How to solve “Incorrect key file for table ‘/tmp/#sql_9315.MYI’; try to repair it” mysql start up error

Saturday, April 28th, 2012

When a server hard disk scape gets filled its common that Apache returns empty (no content) pages…
This just happened in one server I administer. To restore the normal server operation I freed some space by deleting old obsolete backups.
Actually the whole reasons for this mess was an enormous backup files, which on the last monthly backup overfilled the disk empty space.

Though, I freed about 400GB of space on the the root filesystem and on a first glimpse the system had plenty of free hard drive space, still restarting the MySQL server refused to start up properly and spit error:

Incorrect key file for table '/tmp/#sql_9315.MYI'; try to repair it" mysql start up error

Besides that there have been corrupted (crashed) tables, which reported next to above error.
Checking in /tmp/#sql_9315.MYI, I couldn't see any MYI – (MyISAM) format file. A quick google look up revealed that this error is caused by not enough disk space. This was puzzling as I can see both /var and / partitions had plenty of space so this shouldn't be a problem. Also manally creating the file /tmp/#sql_9315.MYI with:

server:~# touch /tmp/#sql_9315.MYI

Didn't help it, though the file created fine. Anyways a bit of a closer examination I've noticed a /tmp filesystem mounted besides with the other file system mounts ????
You can guess my great amazement to find this 1 Megabyte only /tmp filesystem hanging on the server mounted on the server.

I didn't mounted this 1 Megabyte filesystem, so it was either an intruder or some kind of "weird" bug…
I digged in Googling to see, if I can find more on the error and found actually the whole mess with this 1 mb mounted /tmp partition is caused by, just recently introduced Debian init script /etc/init.d/mountoverflowtmp.
It seems this script was introduced in Debian newer releases. mountoverflowtmp is some kind of emergency script, which is triggered in case if the root filesystem/ space gets filled.
The script has only two options:

# /etc/init.d/mountoverflowtmp
Usage: mountoverflowtmp [start|stop]

Once started what it does it remounts the /tmp to be 1 megabyte in size and stops its execution like it never run. Well maybe, the developers had something in mind with introducing this script I will not argue. What I should complain though is the script design is completely broken. Once the script gets "activated" and does its job. This 1MB mount stays like this, even if hard disk space is freed on the root partition – / ….

Hence to cope with this unhandy situation, once I had freed disk space on the root partition for some reason mountoverflowtmp stop option was not working,
So I had to initiate "hard" unmount:

server:~# mount -l /tmp

Also as I had a bunch of crashed tables and to fix them, also issued on each of the broken tables reported on /etc/init.d/mysql start start-up.

server:~# mysql -u root -p
mysql> use Database_Name;
mysql> repair table Table_Name extended;
....

Then to finally solve the stupid Incorrect key file for table '/tmp/#sql_XXYYZZ33444.MYI'; try to repair it error, I had to restart once again the SQL server:

Stopping MySQL database server: mysqld.
Starting MySQL database server: mysqld.
Checking for corrupt, not cleanly closed and upgrade needing tables..
root@server:/etc/init.d#

Tadadadadam!, SQL now loads and works back as before!

How to suspend Cpanel user through ssh / Cpanel shell command to suspend users

Friday, August 5th, 2011

One of the servers running Cpanel has been suspended today and the Data Center decided to completely bring down our server and gave us access to it only through rescue mode running linux livecd.

Thus I had no way to access the Cpanel web interface to suspend the “hacker” who by the way was running a number of instances of this old Romanian script kiddies brute force ssh scanner called sshscan .

Thanksfully Cpanel is equipped with a number of handy scripts for emergency situations in /scripts directory. These shell management scripts are awesome for situations like this one, where no web access is not avaiable.

To suspend the abuser / (abusive user ) I had to issue the command:

root@rescue [/]# /scripts/suspendacct abuse_user
Changing Shell to /bin/false...chsh: Unknown user context is not authorized to change the shell of abuse_user
Done
Locking Password...Locking password for user abuse_user.
passwd: Success
Done
Suspending mysql users
warn [suspendmysqlusers] abuse_user has no databases.
Notification => reports@santrex.net via EMAIL [level => 3]
Account previously suspended (password was locked).
/bin/df: `/proc/sys/fs/binfmt_misc': No such file or directory
Using Universal Quota Support (quota=0)
Suspended document root /home/abuse_user/public_html
Suspended document root /home/abuse_user/public_html/updateverificationonline.com
Using Universal Quota Support (quota=0)
Updating ftp passwords for abuse_user
Ftp password files updated.
Ftp vhost passwords synced
abuse_user's account has been suspended

That’s all now the user is suspended, so hopefully the DC will bring the server online in few minutes.

How to automatically reboot (restart) Debian GNU Lenny / Squeeze Linux on kernel panic, some general CPU overload or system crash

Monday, June 21st, 2010

If you are a system administrator, you have probably wondered at least once ohw to configure your Linux server to automatically reboot itself if it crashes, is going through a mass CPU overload, e.g. the server load average “hits the sky”.
I just learned from a nice article found here that there is a kernel variable which when enabled takes care to automatically restart a crashed server with the terrible Kernel Panic message we all know.

The variable I’m taking about is kernel.panic for instance kernel.panic = 20 would instruct your GNU Linux kernel to automatically reboot if it experiences a kernel panic system crash within a time limit of 20 seconds.

To start using the auto-reboot linux capabilities on a kernel panic occurance just set the variable to /etc/sysctl.conf

debian-server:~# echo 'kernel.panic = 20' >> /etc/sysctl.conf

Now we will also have to enable the variable to start being use on the system, so execute:

debian-server:~# sysctl -p There you go automatic system reboots on kernel panics is now on.
Now to further assure yourself the linux server you’re responsible of will automatically restart itself on a emergency situation like a system overload I suggest you check Watchdog

You might consider checking out this auto reboot tutorial which explains in simple words how watchdog is installed and configured.
On Debian installing and maintaining watchdog is really simple and comes to installing and enabling the watchdog system service, right afteryou made two changes in it’s configuration file /etc/watchdog.conf

To do so execute:

debian-server:~# apt-get install watchdog
debian-server:~# echo "file = /var/log/messages" >> /etc/watchdog.conf
debian-server:~# echo "watchdog-device = /dev/watchdog" >> /etc/watchdog.conf

Well that should be it, you might also need to load some kernel module to monitor your watchdog.
On my system the kernel modules related to watchdog are located in:

/lib/modules/2.6.26-2-amd64/kernel/drivers/watchdog/
If not then you should certainly try the software watchdog linux kernel module called softdog , to do so issue:
debian-server:~# /sbin/modprobe softdog

It’s best if you load the module while the softdog daemon is disabled.
If you consider auto loadig the softdog software watchdog kernel driver you should exec:

debian-server:~# echo 'softdog' >> /etc/modules

Finally a start of the watchdog is necessery:

 


debian-server:~# /etc/init.d/watchdog start
Stopping watchdog keepalive daemon....
Starting watchdog daemon....

That should be all your automatic system reboots should be now on! 🙂

How to reboot remotely Linux server if reboot, shutdown and init commands are not working (/sbin/reboot: Input/output error) – Reboot Linux in emergency using MagicSysRQ kernel sysctl variable

Saturday, July 23rd, 2011

SysRQ an alternative way to restart unrestartable Linux server

I’ve been in a situation today, where one Linux server’s hard drive SCSI driver or the physical drive is starting to break off where in dmesg kernel log, I can see a lot of errors like:

[178071.998440] sd 0:0:0:0: [sda] Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK,SUGGEST_OK
[178071.998440] end_request: I/O error, dev sda, sector 89615868

I tried a number of things to remount the hdd which was throwing out errors in read only mode, but almost all commands I typed on the server were either shown as missng or returning an error:
Input/output error

Just ot give you an idea what I mean, here is a paste from the shell:

linux-server:/# vim /etc/fstab
-bash: vim: command not found
linux-server:/# vi /etc/fstab
-bash: vi: command not found
linux-server:/# mcedit /etc/fstab
-bash: /usr/bin/mcedit: Input/output error
linux-server:/# fdisk -l
-bash: /sbin/fdisk: Input/output error

After I’ve tried all kind of things to try to diagnose the server and all seemed failing, I thought next a reboot might help as on server boot the filesystems will get checked with fsck and fsck might be able to fix (at least temporary) the mess.

I went on and tried to restart the system, and guess what? I got:

/sbin/reboot init Input/output error

I hoped that at least /sbin/shutdown or /sbin/init commands might work out and since I couldn’t use the reboot command I tried this two as well just to get once again:

linux-server:/# shutdown -r now
bash: /sbin/shutdown: Input/output error
linux-server:/# init 6
bash: /sbin/init: Input/output error

You see now the situation was not pinky, it seemed there was no way to reboot the system …
Moreover the server is located in remote Data Center and I the tech support there is conducting assigned task with the speed of a turtle.
The server had no remote reboot, web front end or anything and thefore I needed desperately a way to be able to restart the machine.

A bit of research on the issue has led me to other people who experienced the /sbin/reboot init Input/output error error mostly caused by servers with failing hard drives as well as due to HDD control driver bugs in the Linux kernel.

As I was looking for another alternative way to reboot my Linux machine in hope this would help. I came across a blog post Rebooting the Magic Wayhttp://www.linuxjournal.com/content/rebooting-magic-way

As it was suggested in Cory’s blog a nice alternative way to restart a Linux machine without using reboot, shutdown or init cmds is through a reboot with the Magic SysRQ key combination

The only condition for the Magic SysRQ key to work is to have enabled the SysRQ – CONFIG_MAGIC_SYSRQ in Kernel compile time.
As of today luckily SysRQ Magic key is compiled and enabled by default in almost all modern day Linux distributions in this numbers Debian, Fedora and their derivative distributions.

To use the sysrq kernel capabilities as a mean to restart the server, it’s necessery first to activate the sysrq through sysctl, like so:

linux-server:~# sysctl -w kernel.sysrq=1
kernel.sysrq = 1

I found enabling the kernel.sysrq = 1 permanently in the kernel is also quite a good idea, to achieve that I used:

echo 'kernel.sysrq = 1' >> /etc/sysctl.conf

Next it’s wise to use the sync command to sync any opened files on the server as well stopping as much of the server active running services (MySQL, Apache etc.).

linux-server:~# sync

Now to reboot the Linux server, I used the /proc Linux virtual filesystem by issuing:

linux-server:~# echo b > /proc/sysrq-trigger

Using the echo b > /proc/sysrq-trigger simulates a keyboard key press which does invoke the Magic SysRQ kernel capabilities and hence instructs the kernel to immediately reboot the system.
However one should be careful with using the sysrq-trigger because it’s not a complete substitute for /sbin/reboot or /sbin/shutdown -r commands.
One major difference between the standard way to reboot via /sbin/reboot is that reboot kills all the running processes on the Linux machine and attempts to unmount all filesystems, before it proceeds to sending the kernel reboot instruction.

Using echo b > /proc/sysrq-trigger, however neither tries to umount mounted filesystems nor tries to kill all processes and sync the filesystem, so on a heavy loaded (SQL data critical) server, its use might create enormous problems and lead to severe data loss!

SO BEWARE be sure you know what you’re doing before you proceed using /proc/sysrq-trigger as a way to reboot ;).