Posts Tagged ‘fix’

Must have software on freshly installed windows – Essential Software after fresh Windows install

Friday, March 18th, 2016

Install-update-multiple-programs-applications-at-once-using-ninite

If you're into IT industry even if you don't like installing frequently Windows or you're completely Linux / BSD user, you will certainly have a lot of friends which will want help from you to re-install or fix their Windows 7 / 8 / 10 OS. At least this is the case with me every year, I'm kinda of obliged to install fresh windowses on new bought friends or relatives notebooks / desktop PCs.

Of course according to for whom the new Windows OS installed the preferrences of necessery software varies, however more or less there is sort of standard list of Windows Software which is used daily by most of Avarage Computer user, such as:
 

Not to forget a good candidate from the list to install on new fresh windows Installation candidates are:

  • Winrar
  • PeaZIP
  • WinZip
  • GreenShot (to be able to easily screenshot stuff and save pictures locally and to the cloud)
  • AnyDesk (non free but very functional alternative to TeamViewer) to be able to remotely access remote PC
  • TightVNC
  • ITunes / Spotify (for people who have also iPhone smart phone)
  • DropBox or pCloud (to have some extra cloud free space)
  • FBReader (for those reading a lot of books in different formats)
  • Rufus – Rufus is an efficient and lightweight tool to create bootable USB drives. It helps you to create BIOS or UEFI bootable devices. It helps you to create Windows TO Go drives. It provides support for various disk, format, and partition.
  • Recuva is a data recovery software for Windows 10 (non free)
  • EaseUS (for specific backup / restore data purposes but unfortunately (non free)
  • For designers
  • Adobe Photoshop
  • Adobe Illustrator
  • f.lux –  to control brightness of screen and potentially Save your eyes
  • ImDisk virtual Disk Driver
  • KeePass / PasswordSafe – to Securely store your passwords
  • Putty / MobaXterm / SecureCRT / mPutty (for system administrators and programmers that has to deal with Linux / UNIX)

I tend to install on New Windows installs and thus I have more or less systematized the process.

I try to usually stick to free software where possible for each of the above categories as a Free Software enthusiast and luckily nowadays there is a lot of non-priprietary or at least free as in beer software available out there.

For Windows sysadmins or College and other public institutions networks including multiple of Windows Computers which are not inside a domain and also for people in computer repair shops where daily dozens of windows pre-installs or a set of software Automatic updates are  necessery make sure to take a look at Ninite

ninite-automate-windows-program-deploy-and-update-on-new-windows-os-openoffice-screenshot

As official website introduces Ninite:

Ninite – Install and Update All Your Programs at Once

Of course as Ninite is used by organizations as NASA, Harvard Medical School etc. it is likely the tool might reports your installed list of Windows software and various other Win PC statistical data to Ninite developers and most likely NSA, but this probably doesn't much matter as this is probably by the moment you choose to have installed a Windows OS on your PC.

ninite-choises-to-build-an-install-package-with-useful-essential-windows-software-screenshot
 

For Windows System Administrators managing small and middle sized network PCs that are not inside a Domain Controller, Ninite could definitely save hours and at cases even days of boring install and maintainance work. HP Enterprise or HP Inc. Employees or ex-employees would definitely love Ninite, because what Ninite does is pretty much like the well known HP Internal Tool PC COE.

Ninite could also prepare an installer containing multiple applications based on the choice on Ninite's website, so that's also a great thing especially if you need to deploy a different type of Users PCs (Scientific / Gamers / Working etc.)

Perhaps there are also other useful things to install on a new fresh Windows installations, if you're using something I'm missing let me know in comments.

Fix ruby: /usr/lib/libcrypt.so.1: version `XCRYPT_2.0′ not found in apt upgrade on Debian Linux 10

Saturday, August 5th, 2023

I've an old legacy Thinkpad Laptop that is for simplicty running Window Maker Wmaker which was laying on my home desk for almost an year and I remembered since i'm for few days in my parents home in Dobrich that it will be a good idea to update its software to the latest Debian packages to patch security issues with it. Thus if you're like me and  you tried to update your Debian 10 Linux to the latest Stable release debian packages  and you end up into a critical error that is preventing apt to to resolve conflicts (fix it with) cmds like:

# apt-get update –fix-missing

# apt –fix-broken install

As usual I looked into Google to see about solution and found few articles, claiming to have scripts that fix it but at the end nothing worked.
And the shitty error occured during the standard:

# apt-get update && apt-get upgrade

ruby: /usr/lib/libcrypt.so.1: version `XCRYPT_2.0' not found

Hence the cause and work around seemed to be very unexpected.
For some reason debian makes a link

root@noah:/lib# ls -al /lib/libcrypt.so.1
lrwxrwxrwx 1 root root 19 Aug  3 16:53 /lib/libcrypt.so.1 ->
libcrypt.so.1.bak

root@noah:/lib# ls -al /lib/libcrypt.so.1.bak
lrwxrwxrwx 1 root root 16 Jun 15  2017 /lib/libcrypt.so.1.bak -> libcrypt-2.24.so

Thus to resolve it and force the .deb upgrade package to continue it is up to simply deleting the strange simlink and re-run the

# apt-get update && apt-get upgrade

Setting up libc6:i386 (2.31-13+deb11u6) …
/usr/bin/perl: /lib/libcrypt.so.1: version `XCRYPT_2.0' not found (required by /usr/bin/perl)
dpkg: error processing package libc6:i386 (–configure):
 installed libc6:i386 package post-installation script subprocess returned error exit status 1
Errors were encountered while processing:
 libc6:i386

Few more times. If you get some critical apt failures still, each time make sure to rerun the command after doing a simple removal of the strange simbolic link with cmd:

# rm -f /lib/libcrypt.so.1

That's all folks after a short while your Debian will be updated to latest Enoy folks ! 🙂

List and fix failed systemd failed services after Linux OS upgrade and how to get full info about systemd service from jorunal log

Friday, February 25th, 2022

systemd-logo-unix-linux-list-failed-systemd-services

I have recently upgraded a number of machines from Debian 10 Buster to Debian 11 Bullseye. The update as always has some issues on some machines, such as problem with package dependencies, changing a number of external package repositories etc. to match che Bullseye deb packages. On some machines the update was less painful on others but the overall line was that most of the machines after the update ended up with one or more failed systemd services. It could be that some of the machines has already had this failed services present and I never checked them from the previous time update from Debian 9 -> Debian 10 or just some mess I've left behind in the hurry when doing software installation in the past. This doesn't matter anyways the fact was that I had to deal to a number of systemctl services which I managed to track by the Failed service mesage on system boot on one of the physical machines and on the OpenXen VTY Console the rest of Virtual Machines after update had some Failed messages. Thus I've spend some good amount of time like an overall of a day or two fixing strange failed services. This is how this small article was born in attempt to help sysadmins or any home Linux desktop users, who has updated his Debian Linux / Ubuntu or any other deb based distribution but due to the chaotic nature of Linux has ended with same strange Failed services and look for a way to find the source of the failures and get rid of the problems. 
Systemd is a very complicated system and in my many sysadmin opinion it makes more problems than it solves, but okay for today's people's megalomania mindset it matches well.

Systemd_components-systemd-journalctl-cgroups-loginctl-nspawn-analyze.svg

 

1. Check the journal for errors, running service irregularities and so on
 

First thing to do to track for errors, right after the update is to take some minutes and closely check,, the journalctl for any strange errors, even on well maintained Unix machines, this journal log would bring you to a problem that is not fatal but still some process or stuff is malfunctioning in the background that you would like to solve:
 

root@pcfreak:~# journalctl -x
Jan 10 10:10:01 pcfreak CRON[17887]: pam_unix(cron:session): session closed for user root
Jan 10 10:10:01 pcfreak audit[17887]: USER_END pid=17887 uid=0 auid=0 ses=340858 subj==unconfined msg='op=PAM:session_close grantors=pam_loginuid,pam_env,pam_env,pam_permit>
Jan 10 10:10:01 pcfreak audit[17888]: CRED_DISP pid=17888 uid=0 auid=0 ses=340860 subj==unconfined msg='op=PAM:setcred grantors=pam_permit acct="root" exe="/usr/sbin/cron" >
Jan 10 10:10:01 pcfreak CRON[17888]: pam_unix(cron:session): session closed for user root
Jan 10 10:10:01 pcfreak audit[17888]: USER_END pid=17888 uid=0 auid=0 ses=340860 subj==unconfined msg='op=PAM:session_close grantors=pam_loginuid,pam_env,pam_env,pam_permit>
Jan 10 10:10:01 pcfreak audit[17884]: CRED_DISP pid=17884 uid=0 auid=0 ses=340855 subj==unconfined msg='op=PAM:setcred grantors=pam_permit acct="root" exe="/usr/sbin/cron" >
Jan 10 10:10:01 pcfreak CRON[17884]: pam_unix(cron:session): session closed for user root
Jan 10 10:10:01 pcfreak audit[17884]: USER_END pid=17884 uid=0 auid=0 ses=340855 subj==unconfined msg='op=PAM:session_close grantors=pam_loginuid,pam_env,pam_env,pam_permit>
Jan 10 10:10:01 pcfreak audit[17886]: CRED_DISP pid=17886 uid=0 auid=33 ses=340859 subj==unconfined msg='op=PAM:setcred grantors=pam_permit acct="www-data" exe="/usr/sbin/c>
Jan 10 10:10:01 pcfreak CRON[17886]: pam_unix(cron:session): session closed for user www-data
Jan 10 10:10:01 pcfreak audit[17886]: USER_END pid=17886 uid=0 auid=33 ses=340859 subj==unconfined msg='op=PAM:session_close grantors=pam_loginuid,pam_env,pam_env,pam_permi>
Jan 10 10:10:08 pcfreak NetworkManager[696]:  [1641802208.0899] device (eth1): carrier: link connected
Jan 10 10:10:08 pcfreak kernel: r8169 0000:03:00.0 eth1: Link is Up – 100Mbps/Full – flow control rx/tx
Jan 10 10:10:08 pcfreak kernel: r8169 0000:03:00.0 eth1: Link is Down
Jan 10 10:10:19 pcfreak NetworkManager[696]:
 [1641802219.7920] device (eth1): carrier: link connected
Jan 10 10:10:19 pcfreak kernel: r8169 0000:03:00.0 eth1: Link is Up – 100Mbps/Full – flow control rx/tx
Jan 10 10:10:20 pcfreak kernel: r8169 0000:03:00.0 eth1: Link is Down
Jan 10 10:10:22 pcfreak NetworkManager[696]:
 [1641802222.2772] device (eth1): carrier: link connected
Jan 10 10:10:22 pcfreak kernel: r8169 0000:03:00.0 eth1: Link is Up – 100Mbps/Full – flow control rx/tx
Jan 10 10:10:23 pcfreak kernel: r8169 0000:03:00.0 eth1: Link is Down
Jan 10 10:10:33 pcfreak sshd[18142]: Unable to negotiate with 66.212.17.162 port 19255: no matching key exchange method found. Their offer: diffie-hellman-group14-sha1,diff>
Jan 10 10:10:41 pcfreak NetworkManager[696]:
 [1641802241.0186] device (eth1): carrier: link connected
Jan 10 10:10:41 pcfreak kernel: r8169 0000:03:00.0 eth1: Link is Up – 100Mbps/Full – flow control rx/tx

If you want to only check latest journal log messages use the -x -e (pager catalog) opts

root@pcfreak;~# journalctl -xe

Feb 25 13:08:29 pcfreak audit[2284920]: USER_LOGIN pid=2284920 uid=0 auid=4294967295 ses=4294967295 subj==unconfined msg='op=login acct=28696E76616C>
Feb 25 13:08:29 pcfreak sshd[2284920]: Received disconnect from 177.87.57.145 port 40927:11: Bye Bye [preauth]
Feb 25 13:08:29 pcfreak sshd[2284920]: Disconnected from invalid user ubuntuuser 177.87.57.145 port 40927 [preauth]

Next thing to after the update was to get a list of failed service only.


2. List all systemd failed check services which was supposed to be running

root@pcfreak:/root # systemctl list-units | grep -i failed
● certbot.service                                                                                                       loaded failed failed    Certbot
● logrotate.service                                                                                                     loaded failed failed    Rotate log files
● maldet.service                                                                                                        loaded failed failed    LSB: Start/stop maldet in monitor mode
● named.service                                                                                                         loaded failed failed    BIND Domain Name Server


Alternative way is with the –failed option

hipo@jeremiah:~$ systemctl list-units –failed
  UNIT                        LOAD   ACTIVE SUB    DESCRIPTION
● haproxy.service             loaded failed failed HAProxy Load Balancer
● libvirt-guests.service      loaded failed failed Suspend/Resume Running libvirt Guests
● libvirtd.service            loaded failed failed Virtualization daemon
● nvidia-persistenced.service loaded failed failed NVIDIA Persistence Daemon
● sqwebmail.service           masked failed failed sqwebmail.service
● tpm2-abrmd.service          loaded failed failed TPM2 Access Broker and Resource Management Daemon
● wd_keepalive.service        loaded failed failed LSB: Start watchdog keepalive daemon

LOAD   = Reflects whether the unit definition was properly loaded.
ACTIVE = The high-level unit activation state, i.e. generalization of SUB.
SUB    = The low-level unit activation state, values depend on unit type.
7 loaded units listed.

 

root@jeremiah:/etc/apt/sources.list.d#  systemctl list-units –failed
  UNIT                        LOAD   ACTIVE SUB    DESCRIPTION
● haproxy.service             loaded failed failed HAProxy Load Balancer
● libvirt-guests.service      loaded failed failed Suspend/Resume Running libvirt Guests
● libvirtd.service            loaded failed failed Virtualization daemon
● nvidia-persistenced.service loaded failed failed NVIDIA Persistence Daemon
● sqwebmail.service           masked failed failed sqwebmail.service
● tpm2-abrmd.service          loaded failed failed TPM2 Access Broker and Resource Management Daemon
● wd_keepalive.service        loaded failed failed LSB: Start watchdog keepalive daemon


To get a full list of objects of systemctl you can pass as state:
 

# systemctl –state=help
Full list of possible load states to pass is here
Show service properties


Check whether a service is failed or has other status and check default set systemd variables for it.

root@jeremiah~:# systemctl is-failed vboxweb.service
inactive

# systemctl show haproxy
Type=notify
Restart=always
NotifyAccess=main
RestartUSec=100ms
TimeoutStartUSec=1min 30s
TimeoutStopUSec=1min 30s
TimeoutAbortUSec=1min 30s
TimeoutStartFailureMode=terminate
TimeoutStopFailureMode=terminate
RuntimeMaxUSec=infinity
WatchdogUSec=0
WatchdogTimestampMonotonic=0
RootDirectoryStartOnly=no
RemainAfterExit=no
GuessMainPID=yes
SuccessExitStatus=143
MainPID=304858
ControlPID=0
FileDescriptorStoreMax=0
NFileDescriptorStore=0
StatusErrno=0
Result=success
ReloadResult=success
CleanResult=success

Full output of the above command is dumped in show_systemctl_properties.txt


3. List all running systemd services for a better overview on what's going on on machine
 

To get a list of all properly systemd loaded services you can use –state running.

hipo@jeremiah:~$ systemctl list-units –state running|head -n 10
  UNIT                              LOAD   ACTIVE SUB     DESCRIPTION
  proc-sys-fs-binfmt_misc.automount loaded active running Arbitrary Executable File Formats File System Automount Point
  cups.path                         loaded active running CUPS Scheduler
  init.scope                        loaded active running System and Service Manager
  session-2.scope                   loaded active running Session 2 of user hipo
  accounts-daemon.service           loaded active running Accounts Service
  anydesk.service                   loaded active running AnyDesk
  apache-htcacheclean.service       loaded active running Disk Cache Cleaning Daemon for Apache HTTP Server
  apache2.service                   loaded active running The Apache HTTP Server
  avahi-daemon.service              loaded active running Avahi mDNS/DNS-SD Stack

 

It is useful thing is to list all unit-files configured in systemd and their state, you can do it with:

 


root@pcfreak:~# systemctl list-unit-files
UNIT FILE                                                                 STATE           VENDOR PRESET
proc-sys-fs-binfmt_misc.automount                                         static          –            
-.mount                                                                   generated       –            
backups.mount                                                             generated       –            
dev-hugepages.mount                                                       static          –            
dev-mqueue.mount                                                          static          –            
media-cdrom0.mount                                                        generated       –            
mnt-sda1.mount                                                            generated       –            
proc-fs-nfsd.mount                                                        static          –            
proc-sys-fs-binfmt_misc.mount                                             disabled        disabled     
run-rpc_pipefs.mount                                                      static          –            
sys-fs-fuse-connections.mount                                             static          –            
sys-kernel-config.mount                                                   static          –            
sys-kernel-debug.mount                                                    static          –            
sys-kernel-tracing.mount                                                  static          –            
var-www.mount                                                             generated       –            
acpid.path                                                                masked          enabled      
cups.path                                                                 enabled         enabled      

 

 


root@pcfreak:~# systemctl list-units –type service –all
  UNIT                                   LOAD      ACTIVE   SUB     DESCRIPTION
  accounts-daemon.service                loaded    inactive dead    Accounts Service
  acct.service                           loaded    active   exited  Kernel process accounting
● alsa-restore.service                   not-found inactive dead    alsa-restore.service
● alsa-state.service                     not-found inactive dead    alsa-state.service
  apache2.service                        loaded    active   running The Apache HTTP Server
● apparmor.service                       not-found inactive dead    apparmor.service
  apt-daily-upgrade.service              loaded    inactive dead    Daily apt upgrade and clean activities
 apt-daily.service                      loaded    inactive dead    Daily apt download activities
  atd.service                            loaded    active   running Deferred execution scheduler
  auditd.service                         loaded    active   running Security Auditing Service
  auth-rpcgss-module.service             loaded    inactive dead    Kernel Module supporting RPCSEC_GSS
  avahi-daemon.service                   loaded    active   running Avahi mDNS/DNS-SD Stack
  certbot.service                        loaded    inactive dead    Certbot
  clamav-daemon.service                  loaded    active   running Clam AntiVirus userspace daemon
  clamav-freshclam.service               loaded    active   running ClamAV virus database updater
..

 


linux-systemd-components-diagram-linux-kernel-system-targets-systemd-libraries-daemons

 

4. Finding out more on why a systemd configured service has failed


Usually getting info about failed systemd service is done with systemctl status servicename.service
However, in case of troubles with service unable to start to get more info about why a service has failed with (-l) or (–full) options


root@pcfreak:~# systemctl -l status logrotate.service
● logrotate.service – Rotate log files
     Loaded: loaded (/lib/systemd/system/logrotate.service; static)
     Active: failed (Result: exit-code) since Fri 2022-02-25 00:00:06 EET; 13h ago
TriggeredBy: ● logrotate.timer
       Docs: man:logrotate(8)
             man:logrotate.conf(5)
    Process: 2045320 ExecStart=/usr/sbin/logrotate /etc/logrotate.conf (code=exited, status=1/FAILURE)
   Main PID: 2045320 (code=exited, status=1/FAILURE)
        CPU: 2.479s

Feb 25 00:00:06 pcfreak logrotate[2045577]: 2022/02/25 00:00:06| WARNING: For now we will assume you meant to write /32
Feb 25 00:00:06 pcfreak logrotate[2045577]: 2022/02/25 00:00:06| ERROR: '0.0.0.0/0.0.0.0' needs to be replaced by the term 'all'.
Feb 25 00:00:06 pcfreak logrotate[2045577]: 2022/02/25 00:00:06| SECURITY NOTICE: Overriding config setting. Using 'all' instead.
Feb 25 00:00:06 pcfreak logrotate[2045577]: 2022/02/25 00:00:06| WARNING: (B) '::/0' is a subnetwork of (A) '::/0'
Feb 25 00:00:06 pcfreak logrotate[2045577]: 2022/02/25 00:00:06| WARNING: because of this '::/0' is ignored to keep splay tree searching predictable
Feb 25 00:00:06 pcfreak logrotate[2045577]: 2022/02/25 00:00:06| WARNING: You should probably remove '::/0' from the ACL named 'all'
Feb 25 00:00:06 pcfreak systemd[1]: logrotate.service: Main process exited, code=exited, status=1/FAILURE
Feb 25 00:00:06 pcfreak systemd[1]: logrotate.service: Failed with result 'exit-code'.
Feb 25 00:00:06 pcfreak systemd[1]: Failed to start Rotate log files.
Feb 25 00:00:06 pcfreak systemd[1]: logrotate.service: Consumed 2.479s CPU time.


systemctl -l however is providing only the last log from message a started / stopped or whatever status service has generated. Sometimes systemctl -l servicename.service is showing incomplete the splitted error message as there is a limitation of line numbers on the console, see below

 

root@pcfreak:~# systemctl status -l certbot.service
● certbot.service – Certbot
     Loaded: loaded (/lib/systemd/system/certbot.service; static)
     Active: failed (Result: exit-code) since Fri 2022-02-25 09:28:33 EET; 4h 0min ago
TriggeredBy: ● certbot.timer
       Docs: file:///usr/share/doc/python-certbot-doc/html/index.html
             https://certbot.eff.org/docs
    Process: 290017 ExecStart=/usr/bin/certbot -q renew (code=exited, status=1/FAILURE)
   Main PID: 290017 (code=exited, status=1/FAILURE)
        CPU: 9.771s

Feb 25 09:28:33 pcfrxen certbot[290017]: The error was: PluginError('An authentication script must be provided with –manual-auth-hook when using th>
Feb 25 09:28:33 pcfrxen certbot[290017]: All renewals failed. The following certificates could not be renewed:
Feb 25 09:28:33 pcfrxen certbot[290017]:   /etc/letsencrypt/live/mail.pcfreak.org-0003/fullchain.pem (failure)
Feb 25 09:28:33 pcfrxen certbot[290017]:   /etc/letsencrypt/live/www.eforia.bg-0005/fullchain.pem (failure)
Feb 25 09:28:33 pcfrxen certbot[290017]:   /etc/letsencrypt/live/zabbix.pc-freak.net/fullchain.pem (failure)
Feb 25 09:28:33 pcfrxen certbot[290017]: 3 renew failure(s), 5 parse failure(s)
Feb 25 09:28:33 pcfrxen systemd[1]: certbot.service: Main process exited, code=exited, status=1/FAILURE
Feb 25 09:28:33 pcfrxen systemd[1]: certbot.service: Failed with result 'exit-code'.
Feb 25 09:28:33 pcfrxen systemd[1]: Failed to start Certbot.
Feb 25 09:28:33 pcfrxen systemd[1]: certbot.service: Consumed 9.771s CPU time.

 

5. Get a complete log of journal to make sure everything configured on server host runs as it should

Thus to get more complete list of the message and be able to later google and look if has come with a solution on the internet  use:

root@pcfrxen:~#  journalctl –catalog –unit=certbot

— Journal begins at Sat 2022-01-22 21:14:05 EET, ends at Fri 2022-02-25 13:32:01 EET. —
Jan 23 09:58:18 pcfrxen systemd[1]: Starting Certbot…
░░ Subject: A start job for unit certbot.service has begun execution
░░ Defined-By: systemd
░░ Support: https://www.debian.org/support
░░ 
░░ A start job for unit certbot.service has begun execution.
░░ 
░░ The job identifier is 5754.
Jan 23 09:58:20 pcfrxen certbot[124996]: Traceback (most recent call last):
Jan 23 09:58:20 pcfrxen certbot[124996]:   File "/usr/lib/python3/dist-packages/certbot/_internal/renewal.py", line 71, in _reconstitute
Jan 23 09:58:20 pcfrxen certbot[124996]:     renewal_candidate = storage.RenewableCert(full_path, config)
Jan 23 09:58:20 pcfrxen certbot[124996]:   File "/usr/lib/python3/dist-packages/certbot/_internal/storage.py", line 471, in __init__
Jan 23 09:58:20 pcfrxen certbot[124996]:     self._check_symlinks()
Jan 23 09:58:20 pcfrxen certbot[124996]:   File "/usr/lib/python3/dist-packages/certbot/_internal/storage.py", line 537, in _check_symlinks

root@server:~# journalctl –catalog –unit=certbot|grep -i pluginerror|tail -1
Feb 25 09:28:33 pcfrxen certbot[290017]: The error was: PluginError('An authentication script must be provided with –manual-auth-hook when using the manual plugin non-interactively.')


Or if you want to list and read only the last messages in the journal log regarding a service

root@server:~# journalctl –catalog –pager-end –unit=certbot


If you have disabled a failed service because you don't need it to run at all on the machine with:

root@rhel:~# systemctl stop rngd.service
root@rhel:~# systemctl disable rngd.service

And you want to clear up any failed service information that is kept in the systemctl service log you can do it with:
 

root@rhel:~# systemctl reset-failed

Another useful systemctl option is cat, you can use it to easily list a service it is useful to quickly check what is a service, an actual shortcut to save you from giving a full path to the service e.g. cat /lib/systemd/system/certbot.service

root@server:~# systemctl cat certbot
# /lib/systemd/system/certbot.service
[Unit]
Description=Certbot
Documentation=file:///usr/share/doc/python-certbot-doc/html/index.html
Documentation=https://certbot.eff.org/docs
[Service]
Type=oneshot
ExecStart=/usr/bin/certbot -q renew
PrivateTmp=true


After failed SystemD services are fixed, it is best to reboot the machine and check put some more time to inspect rawly the complete journal log to make sure, no error  was left behind.


Closure
 

As you can see updating a machine from a major to a major version even if you follow the official documentation and you have plenty of experience is always more or a less a pain in the ass, which can eat up much of your time banging your head solving problems with failed daemons issues with /etc/rc.local (which I have faced becase of #/bin/sh -e (which would make /etc/rc.local) to immediately quit if any error from command $? returns different from 0 etc.. The  logical questions comes then;
1. Is it really worthy to update at all regularly, especially if you don't know of a famous major Vulnerability 🙂 ?
2. Or is it worthy to update from OS major release to OS major release at all?  
3. Or should you only try to patch the service that is exposed to an external reachable computer network or the internet only and still the the same OS release until End of Life (LTS = Long Term Support) as called in Debian or  End Of Life  (EOL) Cycle as called in RPM based distros the period until the OS major release your software distro has official security patches is reached.

Anyone could take any approach but for my own managed systems small network at home my practice was always to try to keep up2date everything every 3 or 6 months maximum. This has caused me multiple days of irritation and stress and perhaps many white hairs and spend nerves on shit.


4. Based on the company where I'm employed the better strategy is to patch to the EOL is still offered and keep the rule First Things First (FTF), once the EOL is reached, just make a copy of all servers data and configuration to external Data storage, bring up a new Physical or VM and migrate the services.
Test after the migration all works as expected if all is as it should be change the DNS records or Leading Infrastructure Proxies whatever to point to the new service and that's it! Yes it is true that migration based on a full OS reinstall is more time consuming and requires much more planning, but usually the result is much more expected, plus it is much less stressful for the guy doing the job.

Debug and fix Virtuozzo / KVM broken Hypervisor error: ‘PrlSDKError(‘SDK error: 0x80000249: Unable to connect to Virtuozzo. You may experience a connection problem or the server may be down.’ on CentOS Linux howto

Thursday, January 28th, 2021

fix-sdkerror-virtuozzo-kvm-how-to-debug-problems-with-hypervisor-host-linux

I've recently yum upgraded a CentOS Linux server runinng Virtuozzo kernel and Virtuozzo virtualization Virtual Machines to the latest available CentOS Linux release 7.9.2009 (Core) just to find out after the upgrade there was issues with both virtuozzo (VZ) way to list installed VZ enabled VMs reporting Unable to connect to Virtuozzo error like below:
 

[root@CENTOS etc]# prlctl list -a
Unable to connect to Virtuozzo. You may experience a connection problem or the server may be down. Contact your Virtuozzo administrator for assistance.


Even the native QEMU KVM VMs installed on the Hypervisor system failed to work to list and bring up the VMs producing another unexplainable error with virsh unable to connect to the hypervisor socket

[root@CENTOS etc]# virsh list –all
error: failed to connect to the hypervisor
error: Failed to connect socket to '/var/run/libvirt/libvirt-sock': No such file or directory


In dmesg cmd kernel log messages the error found looked as so:

[root@CENTOS etc]# dmesg|grep -i sdk


[    5.314601] PrlSDKError('SDK error: 0x80000249: Unable to connect to Virtuozzo. You may experience a connection problem or the server may be down. Contact your Virtuozzo administrator for assistance.',)

To fix it I had to experiment a bit based on some suggestions from Google results as usual and what turned to be the cause is a now obsolete setting for disk probing that is breaking libvirtd

Disable allow_disk_format_probing in /etc/libvirt/qemu.conf

The fix to PrlSDKError('SDK error: 0x80000249: Unable to connect to Virtuozzo comes to commenting a parameter inside 

/etc/libvirt/qemu.conf

which for historical reasons seems to be turned on by default it is like this

allow_disk_format_probing = 1


Resolution is to either change the value to 0 or completely comment the line:

[root@CENTOS etc]# grep allow_disk_format_probing /etc/libvirt/qemu.conf
# If allow_disk_format_probing is enabled, libvirt will probe disk
#allow_disk_format_probing = 1
#allow_disk_format_probing = 1


Debug problems with Virtuozzo services and validate virtualization setup

What really helped to debug the issue was to check the extended status info of libvirtd.service vzevent vz.service libvirtguestd.service prl-disp systemd services

[root@CENTOS etc]# systemctl -l status libvirtd.service vzevent vz.service libvirtguestd.service prl-disp

Here I had to analyze the errors and googled a little bit about it


Once this is changed I had to of course restart libvirtd.service and rest of virtuozzo / kvm services

[root@CENTOS etc]# systemctl restart libvirtd.service ibvirtd.service vzevent vz.service libvirtguest.service prl-disp


Another useful tool part of a standard VZ install that I've used to make sure each of the Host OS Hypervisor components is running smoothly is virt-host-validate (tool is part of libvirt-client rpm package)

[root@CENTOS etc]# virt-host-validate
  QEMU: Checking for hardware virtualization                                 : PASS
  QEMU: Checking if device /dev/kvm exists                                   : PASS
  QEMU: Checking if device /dev/kvm is accessible                            : PASS
  QEMU: Checking if device /dev/vhost-net exists                             : PASS
  QEMU: Checking if device /dev/net/tun exists                               : PASS
  QEMU: Checking for cgroup 'memory' controller support                      : PASS
  QEMU: Checking for cgroup 'memory' controller mount-point                  : PASS
  QEMU: Checking for cgroup 'cpu' controller support                         : PASS
  QEMU: Checking for cgroup 'cpu' controller mount-point                     : PASS
  QEMU: Checking for cgroup 'cpuacct' controller support                     : PASS
  QEMU: Checking for cgroup 'cpuacct' controller mount-point                 : PASS
  QEMU: Checking for cgroup 'cpuset' controller support                      : PASS
  QEMU: Checking for cgroup 'cpuset' controller mount-point                  : PASS
  QEMU: Checking for cgroup 'devices' controller support                     : PASS
  QEMU: Checking for cgroup 'devices' controller mount-point                 : PASS
  QEMU: Checking for cgroup 'blkio' controller support                       : PASS
  QEMU: Checking for cgroup 'blkio' controller mount-point                   : PASS
  QEMU: Checking for device assignment IOMMU support                         : PASS
  QEMU: Checking if IOMMU is enabled by kernel                               : WARN (IOMMU appears to be disabled in kernel. Add intel_iommu=on to kernel cmdline arguments)
   LXC: Checking for Linux >= 2.6.26                                         : PASS
   LXC: Checking for namespace ipc                                           : PASS
   LXC: Checking for namespace mnt                                           : PASS
   LXC: Checking for namespace pid                                           : PASS
   LXC: Checking for namespace uts                                           : PASS
   LXC: Checking for namespace net                                           : PASS
   LXC: Checking for namespace user                                          : PASS
   LXC: Checking for cgroup 'memory' controller support                      : PASS
   LXC: Checking for cgroup 'memory' controller mount-point                  : PASS
   LXC: Checking for cgroup 'cpu' controller support                         : PASS
   LXC: Checking for cgroup 'cpu' controller mount-point                     : PASS
   LXC: Checking for cgroup 'cpuacct' controller support                     : PASS
   LXC: Checking for cgroup 'cpuacct' controller mount-point                 : PASS
   LXC: Checking for cgroup 'cpuset' controller support                      : PASS
   LXC: Checking for cgroup 'cpuset' controller mount-point                  : PASS
   LXC: Checking for cgroup 'devices' controller support                     : PASS
   LXC: Checking for cgroup 'devices' controller mount-point                 : PASS
   LXC: Checking for cgroup 'blkio' controller support                       : PASS
   LXC: Checking for cgroup 'blkio' controller mount-point                   : PASS
   LXC: Checking if device /sys/fs/fuse/connections exists                   : PASS


One thing to note here that virt-host-validate helped me to realize the  fuse (File system in userspace) module kernel support enabled on the HV was missing so I've enabled temporary for this boot with modprobe and permanently via a configuration like so:

# to load it one time
[root@CENTOS etc]#  modprobe fuse
# to load fuse permnanently on next boot

[root@CENTOS etc]#  echo fuse >> /etc/modules-load.d/fuse.conf

Disable selinux on CentOS HV

Another thing was selinux was enabled on the HV. Selinux is really annoying thing and to be honest I never used it on any server and though its idea is quite good the consequences it creates for daily sysadmin work are terrible so I usually disable it. It could be that a Hypervisor Host OS might work just normal with the selinux enabled but just in case I decided to remove it. This is how

[root@CENTOS etc]#  sestatus
SELinux status:                 enabled
SELinuxfs mount:                /sys/fs/selinux
SELinux root directory:         /etc/selinux
Loaded policy name:             targeted
Current mode:                   enforcing
Mode from config file:          enforcing
Policy MLS status:              enabled
Policy deny_unknown status:     allowed
Max kernel policy version:      31

To temporarily change the SELinux mode from targeted to permissive with the following command:

[root@CENTOS etc]#  setenforce 0

Edit /etc/selinux/config file and set the SELINUX mod to disabled

[root@CENTOS etc]# vim /etc/selinux/config
# This file controls the state of SELinux on the system.
# SELINUX= can take one of these three values:
#       enforcing – SELinux security policy is enforced.
#       permissive – SELinux prints warnings instead of enforcing.
#       disabled – No SELinux policy is loaded.
SELINUX=disabled
# SELINUXTYPE= can take one of these two values:
#       targeted – Targeted processes are protected,
#       mls – Multi Level Security protection.
SELINUXTYPE=targeted

Finally rebooted graceously the machine just in case with the good recommended way to reboot servers with shutdown command instead of /sbin/reboot

[root@CENTOS etc]# shutdown -r now

The advantage of shutdown is that it tries to shutdown each service by sending stop requests but usually this takes some time and even a shutdown request could take longer to proccess as each service such as a WebServer application is being waited to close all its network connections etc. |
However if you want to have a quick reboot and you don't care about any established network connections to third party IPs you can go for the brutal old fashioned /sbin/reboot 🙂

How to fix rkhunter checking dev for suspisiocus files, solve rkhunter checking if SSH root access is allowed warning

Friday, November 20th, 2020

rkhunter-logo

On a server if you have a rkhunter running and you suddenly you get some weird Warnings for suspicious files under dev, like show in in the screenshot and you're puzzled how comes this happened as so far it was not reported before the regular package patching update conducted …

root@haproxy-server ~]# rkhunter –check

rkhunter-warn-screenshot

To investigate further I've checked rkhunter produced log /var/log/rkhunter.log for a verobose message and found more specifics there on what is the exact files which rkhunter finds suspicious.
To further investigate what exactly are this suspicious files for or where, they're used for something on the system or in reality it is a hacker who hacked our supposibly PCI compliant system,
I've used the good old fuser command which is capable to show which system process is actively using a file. To have fuser report for each file from /var/log/rkhunter.log with below shell loop:

[root@haproxy-server ~]#  for i in $(tail -n 50 /var/log/rkhunter/rkhunter.log|grep -i /dev/shm|awk '{ print $2 }'|sed -e 's#:##g'); do fuser -v $i; done
                     BEN.        PID ZUGR.  BEFEHL
/dev/shm/qb-1783-1851-27-f1sTlC/qb-request-cpg-header:
                     root       1783 ….m corosync
                     hacluster   1851 ….m attrd
                     BEN.        PID ZUGR.  BEFEHL
/dev/shm/qb-1783-1844-26-Znk1UM/qb-event-quorum-data:
                     root       1783 ….m corosync
                     root       1844 ….m pacemakerd
                     BEN.        PID ZUGR.  BEFEHL
/dev/shm/qb-1783-1844-26-Znk1UM/qb-event-quorum-header:
                     root       1783 ….m corosync
                     root       1844 ….m pacemakerd
                     BEN.        PID ZUGR.  BEFEHL
/dev/shm/qb-1783-1844-26-Znk1UM/qb-response-quorum-data:
                     root       1783 ….m corosync
                     root       1844 ….m pacemakerd
                     BEN.        PID ZUGR.  BEFEHL
/dev/shm/qb-1783-1844-26-Znk1UM/qb-response-quorum-header:
                     root       1783 ….m corosync
                     root       1844 ….m pacemakerd
                     BEN.        PID ZUGR.  BEFEHL
/dev/shm/qb-1783-1844-26-Znk1UM/qb-request-quorum-data:
                     root       1783 ….m corosync
                     root       1844 ….m pacemakerd
                     BEN.        PID ZUGR.  BEFEHL
/dev/shm/qb-1783-1844-26-Znk1UM/qb-request-quorum-header:
                     root       1783 ….m corosync
                     root       1844 ….m pacemakerd
                     BEN.        PID ZUGR.  BEFEHL
/dev/shm/qb-1783-1844-25-oCdaKX/qb-event-cpg-data:
                     root       1783 ….m corosync
                     root       1844 ….m pacemakerd
                     BEN.        PID ZUGR.  BEFEHL
/dev/shm/qb-1783-1844-25-oCdaKX/qb-event-cpg-header:
                     root       1783 ….m corosync
                     root       1844 ….m pacemakerd
                     BEN.        PID ZUGR.  BEFEHL
/dev/shm/qb-1783-1844-25-oCdaKX/qb-response-cpg-data:
                     root       1783 ….m corosync
                     root       1844 ….m pacemakerd
                     BEN.        PID ZUGR.  BEFEHL
/dev/shm/qb-1783-1844-25-oCdaKX/qb-response-cpg-header:
                     root       1783 ….m corosync
                     root       1844 ….m pacemakerd
                     BEN.        PID ZUGR.  BEFEHL
/dev/shm/qb-1783-1844-25-oCdaKX/qb-request-cpg-data:
                     root       1783 ….m corosync
                     root       1844 ….m pacemakerd
                     BEN.        PID ZUGR.  BEFEHL
/dev/shm/qb-1783-1844-25-oCdaKX/qb-request-cpg-header:
                     root       1783 ….m corosync
                     root       1844 ….m pacemakerd
                     BEN.        PID ZUGR.  BEFEHL
/dev/shm/qb-1783-1844-24-GKyj3l/qb-event-cfg-data:
                     root       1783 ….m corosync
                     root       1844 ….m pacemakerd
                     BEN.        PID ZUGR.  BEFEHL
/dev/shm/qb-1783-1844-24-GKyj3l/qb-event-cfg-header:
                     root       1783 ….m corosync
                     root       1844 ….m pacemakerd
                     BEN.        PID ZUGR.  BEFEHL
/dev/shm/qb-1783-1844-24-GKyj3l/qb-response-cfg-data:
                     root       1783 ….m corosync
                     root       1844 ….m pacemakerd
                     BEN.        PID ZUGR.  BEFEHL
/dev/shm/qb-1783-1844-24-GKyj3l/qb-response-cfg-header:
                     root       1783 ….m corosync
                     root       1844 ….m pacemakerd
                     BEN.        PID ZUGR.  BEFEHL
/dev/shm/qb-1783-1844-24-GKyj3l/qb-request-cfg-data:
                     root       1783 ….m corosync
                     root       1844 ….m pacemakerd
                     BEN.        PID ZUGR.  BEFEHL
/dev/shm/qb-1783-1844-24-GKyj3l/qb-request-cfg-header:
                     root       1783 ….m corosync
                     root       1844 ….m pacemakerd


As you see from the output all the /dev/shm/qb/ files in question are currently opened by the corosync / pacemaker and necessery for proper work of the haproxy cluster processes running on the machines.
 

How to solve the /dev/ suspcisios files rkhunter warning?

To solve we need to tell rkhunter not check against this files this is done via  /etc/rkhunter.conf first I thought this is done by EXISTWHITELIST= but then it seems there is  a special option for rkhunter whitelisting /dev type of files only ALLOWDEVFILE.

Hence to resolve the warning for the upcoming planned early PCI audit and save us troubles we had to add on running OS which is CentOS Linux release 7.8.2003 (Core) in /etc/rkhunter.conf

ALLOWDEVFILE=/dev/shm/qb-*/qb-*

Re-run

# rkhunter –check

and Voila, the warning should be no more.

rkhunter-check-output

Another thing is on another machine the warnings produced by rkhunter were a bit different as rkhunter has mistakenly detected the root login is enabled where in reality PermitRootLogin was set to no in /etc/ssh/sshd_config

rkhunter-warning

As the problem was experienced on some machines and on others it was not.
I've done the standard boringconfig comparison we sysadmins do to tell
why stuff differs.
The result was on first machine where we had everything working as expected and
PermitRootLogin no was recognized the correct configuration was:

— SNAP —
#ALLOW_SSH_ROOT_USER=no
ALLOW_SSH_ROOT_USER=unset
— END —

On the second server where the problem was experienced the values was:

— SNAP —
#ALLOW_SSH_ROOT_USER=unset
ALLOW_SSH_ROOT_USER=no
— END —

Note that, the warning produced regarding the rsyslog remote logging is allowed is perfectly fine as, we had enabled remote logging to a central log server on the machines, this is done with:

This is done with config options under /etc/rsyslog.conf

# Configure Remote rsyslog logging server
*.* @remote-logging-server.com:514
*.* @remote-logging-server.com:514

Fix FTP active connection issues “Cannot create a data connection: No route to host” on ProFTPD Linux dedicated server

Tuesday, October 1st, 2019

proftpd-linux-logo

Earlier I've blogged about an encounter problem that prevented Active mode FTP connections on CentOS
As I'm working for a client building a brand new dedicated server purchased from Contabo Dedi Host provider on a freshly installed Debian 10 GNU / Linux, I've had to configure a new FTP server, since some time I prefer to use Proftpd instead of VSFTPD because in my opinion it is more lightweight and hence better choice for a small UNIX server setups. During this once again I've encounted the same ACTIVE FTP not working from FTP server to FTP client host machine. But before shortly explaining, the fix I find worthy to explain briefly what is ACTIVE / PASSIVE FTP connection.

 

1. What is ACTIVE / PASSIVE FTP connection?
 

Whether in active mode, the client specifies which client-side port the data channel has been opened and the server starts the connection. Or in other words the default FTP client communication for historical reasons is in ACTIVE MODE. E.g.
Client once connected to Server tells the server to open extra port or ports locally via which the overall FTP data transfer will be occuring. In the early days of networking when FTP protocol was developed security was not of such a big concern and usually Networks did not have firewalls at all and the FTP DATA transfer host machine was running just a single FTP-server and nothing more in this, early days when FTP was not even used over the Internet and FTP DATA transfers happened on local networks, this was not a problem at all.

In passive mode, the server decides which server-side port the client should connect to. Then the client starts the connection to the specified port.

But with the ever increasing complexity of Internet / Networks and the ever tightening firewalls due to viruses and worms that are trying to own and exploit networks creating unnecessery bulk loads this has changed …

active-passive-ftp-explained-diagram
 

2. Installing and configure ProFTPD server Public ServerName

I've installed the server with the common cmd:

 

apt –yes install proftpd

 

And the only configuration changed in default configuration file /etc/proftpd/proftpd.conf  was
ServerName          "Debian"

I do this in new FTP setups for the logical reason to prevent the multiple FTP Vulnerability Scan script kiddie Crawlers to know the exact OS version of the server, so this was changed to:

 

ServerName "MyServerHostname"

 

Though this is the bad security through obscurity practice doing so is a good practice.
 

3. Create iptable firewall rules to allow ACTIVE FTP mode


But anyways, next step was to configure the firewall to be allowed to communicate on TCP PORT 21 and 20 to incoming source ports range 1024:65535 (to enable ACTIVE FTP) on firewal level with iptables on INPUT and OUTPUT chain rules, like this:

 

iptables -A INPUT -p tcp –sport 1024:65535 -d 0/0 –dport 21 -m state –state NEW,ESTABLISHED -j ACCEPT
iptables -A INPUT -p tcp -s 0/0 –sport 1024:65535 -d 0/0 –dport 20 -m state –state NEW,ESTABLISHED -j ACCEPT
iptables -A OUTPUT -p tcp -s 0/0 –sport 21 -d 0/0 –dport 1024:65535 -m state –state ESTABLISHED -j ACCEPT
iptables -A OUTPUT -p tcp -s 0/0 –sport 20 -d 0/0 –dport 1024:65535 -m state –state ESTABLISHED,RELATED -j ACCEPT


Talking about Active and Passive FTP connections perhaps for novice Linux users it might be worthy to say few words on Active and Passive FTP connections

Once firewall has enabled FTP Active / Passive connections is on and FTP server is listening, to test all is properly configured check iptable rules and FTP listener:
 

/sbin/iptables -L INPUT |grep ftp
ACCEPT     tcp  —  anywhere             anywhere             tcp spts:1024:65535 dpt:ftp state NEW,ESTABLISHED
ACCEPT     tcp  —  anywhere             anywhere             tcp spts:1024:65535 dpt:ftp-data state NEW,ESTABLISHED
ACCEPT     tcp  —  anywhere             anywhere             tcp dpt:ftp
ACCEPT     tcp  —  anywhere             anywhere             tcp dpt:ftp-data

netstat -l | grep "ftp"
tcp6       0      0 [::]:ftp                [::]:*                  LISTEN    

 

4. Loading nf_nat_ftp module and net.netfilter.nf_conntrack_helper (for backward compitability)


Next step of course was to add the necessery modules nf_nat_ftp nf_conntrack_sane that makes FTP to properly forward ports with respective Firewall states on any of above source ports which are usually allowed by firewalls, note that the range of ports given 1024:65535 might be too much liberal for paranoid sysadmins and in many cases if ports are not filtered, if you are a security freak you can use some smaller range such as 60000-65535.

 

Here is time to say for sysadmins who haven't recently had a task to configure a new (unecrypted) File Transfer Server as today Secure FTP is almost alltime used for file transfers for the sake of security might be puzzled to find out the old Linux kernel ip_conntrack_ftp which was the standard module used to make FTP Active connections work is substituted nowadays with  nf_nat_ftp and nf_conntrack_sane.

To make the 2 modules permanently loaded on next boot on Debian Linux they have to be added to /etc/modules

Here is how sample /etc/modules that loads the modules on next system boot looks like

cat /etc/modules
# /etc/modules: kernel modules to load at boot time.
#
# This file contains the names of kernel modules that should be loaded
# at boot time, one per line. Lines beginning with "#" are ignored.
softdog
nf_nat_ftp
nf_conntrack_sane


Next to say is that in newer Linux kernels 3.x / 4.x / 5.x the nf_nat_ftp and nf_conntrack-sane behaviour changed so  simply loading the modules would not work and if you do the stupidity to test it with some FTP client (I used gFTP / ncftp from my Linux desktop ) you are about to get FTP No route to host errors like:

 

Cannot create a data connection: No route to host

 

cannot-create-a-data-connection-no-route-to-host-linux-error-howto-fix


Sometimes, instead of No route to host error the error FTP client might return is:

 

227 entering passive mode FTP connect connection timed out error


To make the nf_nat_ftp module on newer Linux kernels hence you have to enable backwards compatibility Kernel variable

 

 

/proc/sys/net/netfilter/nf_conntrack_helper

 

echo 1 > /proc/sys/net/netfilter/nf_conntrack_helper

 

To make it permanent if you have enabled /etc/rc.local legacy one single file boot place as I do on servers – for how to enable rc.local on newer Linuxes check here

or alternatively add it to load via sysctl

sysctl -w net.netfilter.nf_conntrack_helper=1

And to make change permanent (e.g. be loaded on next boot)

echo 'net.netfilter.nf_conntrack_helper=1' >> /etc/sysctl.conf

 

5. Enable PassivePorts in ProFTPD or PassivePortRange in PureFTPD


Last but not least open /etc/proftpd/proftpd.conf find PassivePorts config value (commented by default) and besides it add the following line:

 

PassivePorts 60000 65534

 

Just for information if instead of ProFTPd you experience the error on PureFTPD the configuration value to set in /etc/pure-ftpd.conf is:
 

PassivePortRange 30000 35000


That's all folks, give the ncftp / lftp / filezilla or whatever FTP client you prefer and test it the FTP client should be able to talk as expected to remote server in ACTIVE FTP mode (and the auto passive mode) will be not triggered anymore, nor you will get a strange errors and failure to connect in FTP clients as gftp.

Cheers 🙂

Why du and df reporting different on a filesystem / How to fix inconsistency between used space on FS and disk showing full strangeness

Wednesday, July 24th, 2019

linux-why-du-and-df-shows-different-result-inconsincy-explained-filesystem-full-oddity

If you're a sysadmin on a large server environment such as a couple of hundred of Virtual Machines running Linux OS on either physical host or OpenXen / VmWare hosted guest Virtual Machine, you might end up sometimes at an odd case where some mounted partition mount point reports its file use different when checked with
df
cmd than when checked with du command, like for example:
 

root@sqlserver:~# df -hT /var/lib/mysql
Filesystem   Type  Size Used Avail Use% Mounted On
/dev/sdb5      ext4    19G  3,4G    14G  20% /var/lib/mysql

Here the '-T' argument is used to show us the filesystem.

root@sqlserver:~# du -hsc /var/lib/mysql
0K    /var/lib/mysql/
0K    total

 

1. Simple debug on what might be the root cause for df / du inconsistency reporting

 

Of course the basic thing to do when in that weird situation is to be totally shocked how this is possible and to investigate a bit what is the biggest first level sub-directories that eat up the space on the mounted location, with du:

 

# du -hkx –max-depth=1 /var/lib/mysql/|uniq|sort -n
4       /var/lib/mysql/test
8       /var/lib/mysql/ezmlm
8       /var/lib/mysql/micropcfreak
8       /var/lib/mysql/performance_schema
12      /var/lib/mysql/mysqltmp
24      /var/lib/mysql/speedtest
64      /var/lib/mysql/yourls
144     /var/lib/mysql/narf
320     /var/lib/mysql/webchat_plus
424     /var/lib/mysql/goodfaithair
528     /var/lib/mysql/moonman
648     /var/lib/mysql/daniel
852     /var/lib/mysql/lessn
1292    /var/lib/mysql/gallery

The given output is in Kilobytes so it is a little bit hard to read, if you're used to Mbytes instead, do

 

 # du -hmx –max-depth=1 /var/lib/mysql/|uniq|sort -n|less

 

I've also investigated on the complete /var directory contents sorted by size with:

 

 # du -akx ./ | sort -n
5152564    ./cache/rsnapshot/hourly.2/localhost
5255788    ./cache/rsnapshot/hourly.2
5287912    ./cache/rsnapshot
7192152    ./cache


Even after finding out the bottleneck dirs and trying to clear up a bit, continued facing that inconsistently shown in two commands and if you're likely to be stunned like me and try … to move some files to a different filesystem to free up space or assigned inodes with a hope that shown inconsitency output will be fixed as it might be caused  due to some kernel / FS caching ?? and this will eventually make the mounted FS to refresh …

But unfortunately, if you try it you'll figure out clearing up a couple of Megas or Gigas will make no difference in cmd output.

In my exact case /var/lib/mysql is a separate mounted ext4 filesystem, however same issue was present also on a Network Filesystem (NFS) and thus, my first thought that this is caused by a network failure problem or NFS bug turned to be wrong.

After further short investigation on the inodes on the Filesystem, it was clear enough inodes are available:
 

# df -i /var/lib/mysql
Filesystem       Inodes  IUsed   IFree IUse% Mounted on
/dev/sdb5      1221600  2562 1219038   1% /var/lib/mysql

 

So the filled inodes count assumed issue also has been rejected.
P.S. (if you're not well familiar with them read manual, i.e. – man 7 inode).
 

– Remounting the mounted filesystem

To make sure the filesystem shown inconsistency between du and df is not due to some hanging network mount or bug, first logical thing I did is to remount the filesytem showing different in size, in my case this was done with:
 

# mount -o remount,rw -t ext4 /var/lib/mysql

For machines with NFS remote mounted storage locations, used:

# mount -o remount,rw -t nfs /var/www


FS remount did not solved it so I continued to ponder what oddity and of course I thought of a workaround (in case if this issues are caused by kernel bug or OS lib issue) reboot might be the solution, however unfortunately restarting the VMs was not a wanted easy to do solution, thus I continued investigating what is wrong …

Next check of course was to check, what kind of network connections are opened to the affected hosts with:
 

# netstat -tupanl


Did not found anything that might point me to the reported different Megabytes issue, so next step was to check what is the situation with currently opened files by running processes on the weird df / du reported systems with lsof, and boom there I observed oddity such as multiple files

 

# lsof -nP | grep '(deleted)'

COMMAND   PID   USER   FD   TYPE DEVICE    SIZE NLINK  NODE NAME
mysqld   2588  mysql    4u   REG 253,17      52     0  1495 /var/lib/mysql/tmp/ibY0cXCd (deleted)
mysqld   2588  mysql    5u   REG 253,17    1048     0  1496 /var/lib/mysql/tmp/ibOrELhG (deleted)
mysqld   2588  mysql    6u   REG 253,17       777884290     0  1497 /var/lib/mysql/tmp/ibmDFAW8 (deleted)
mysqld   2588  mysql    7u   REG 253,17       123667875     0 11387 /var/lib/mysql/tmp/ib2CSACB (deleted)
mysqld   2588  mysql   11u   REG 253,17       123852406     0 11388 /var/lib/mysql/tmp/ibQpoZ94 (deleted)

 

Notice that There were plenty of '(deleted)' STATE files shown in memory an overall of 438:

 

# lsof -nP | grep '(deleted)' |wc -l
438


As I've learned a bit online about the problem, I found it is also possible to find deleted unlinked files only without any greps (to list all deleted files in memory files with lsof args only):

 

# lsof +L1|less


The SIZE field (fourth column)  shows a number of files that are really hard in size and that are kept in open on filesystem and in memory, totally messing up with the filesystem. In my case this is temp files created by MYSQLD daemon but depending on the server provided service this might be apache's www-data, some custom perl / bash script executed via a cron job, stalled rsync jobs etc.
 

2. Check all the list open files with the mysql / root user as part of the the server filesystem inconsistency debugging with:

 

– Grep opened files on server by user

# lsof |grep mysql
mysqld    1312                       mysql  cwd       DIR               8,21       4096          2 /var/lib/mysql
mysqld    1312                       mysql  rtd       DIR                8,1       4096          2 /
mysqld    1312                       mysql  txt       REG                8,1   20336792   23805048 /usr/sbin/mysqld
mysqld    1312                       mysql  mem       REG               8,21      24576         20 /var/lib/mysql/tc.log
mysqld    1312                       mysql  DEL       REG               0,16                 29467 /[aio]
mysqld    1312                       mysql  mem       REG                8,1      55792   14886933 /lib/x86_64-linux-gnu/libnss_files-2.28.so

 

# lsof | grep root
COMMAND    PID   TID TASKCMD          USER   FD      TYPE             DEVICE   SIZE/OFF       NODE NAME
systemd      1                        root  cwd       DIR                8,1       4096          2 /
systemd      1                        root  rtd       DIR                8,1       4096          2 /
systemd      1                        root  txt       REG                8,1    1489208   14928891 /lib/systemd/systemd
systemd      1                        root  mem       REG                8,1    1579448   14886924 /lib/x86_64-linux-gnu/libm-2.28.so

Other command that helped to track the discrepancy between df and du different file usage on FS is:
 

# du -hxa  / | egrep '^[[:digit:]]{1,1}G[[:space:]]*'
 

 

3. Fixing large files kept in memory filesystem problem


What is the real reason for ending up with this file handlers opened by running backgrounded programs on the Linux OS?
It could be multiple  but most likely it is due to exceeded server / client interactions or breaking up RAM or HDD drive with writing plenty of logs on the FS without ending keeping space occupied or Programming library bugs used by hanged service leaving the FH opened on storage.

What is the solution to file system files left in memory problem?

The best solution is to first fix custom script or hanged service and then if possible to simply restart the server to make the kernel / services reload or if this is not possible just restart the problem creation processes.

Once the process is identified like in my case this was MySQL on systemd enabled newer OS distros, just do:

 

 

# systemctl restart mysqld.service


or on older init.d system V ones:

# /etc/init.d/service restart


For custom hanged scripts being listed in ps axuwef you can grep the pid and do a kill -HUP (if the script is written in a good way to recognize -HUP and restart the sub-running process properly – BE EXTRA CAREFUL IF YOU'RE RESTARTING BROKEN SCRIPTS as this might cause your running service disruptions …).

# pgrep -l script.sh
7977 script.sh


# kill -HUP PID

 

Now finally this should either mitigate or at best case completely solve the reported disagreement between df and du, after which the calculated / reported disk space should be back to normal and show up approximately the same (note that size changes a bit as mysql service is writting data) constantly extending the size between the two checks.

 

# df -hk /var/lib/mysql; du -hskc /var/lib/mysql
Filesystem       Inodes  IUsed   IFree IUse% Mounted on
/dev/sdb5        19097172 3472744 14631296  20% /var/lib/mysql
3427772    /var/lib/mysql
3427772    total

 

What we learned?

What I've explained in this article is why and how it comes that 'zoombie' files reside on a filesystem
appearing to be eating disk space on a mounted local or network partition, giving strange inconsistent
reports, leading to system service disruptions and impossibility to have correctly shown information on used
disk space on mounted drive.

I went through with some standard logic on debugging service / filesystem / inode issues up explainat, that led me to the finding about deleted files being kept in filesystem and producing the filesystem strange sized / showing not correct / filled even after it was extended with tune2fs and was supposed to have extra 50GBs.

Finally it was explained shortly how to HUP / restart hanging script / service to fix it.

Some few good readings that helped to fix the issue:

What to do when du and df report different usage is here
df in linux not showing correct free space after file removal is here
Why do “df” and “du” commands show different disk usage?
 

Automatic network restart and reboot Linux server script if ping timeout to gateway is not responding as a way to reduce connectivity downtimes

Monday, December 10th, 2018

automatic-server-network-restart-and-reboot-script-if-connection-to-server-gateway-inavailable-tux-penguing-ascii-art-bin-bash

Inability of server to come back online server automaticallyafter electricity / network outage

These days my home server  is experiencing a lot of issues due to Electricity Power Outages, a construction dig operations to fix / change waterpipe tubes near my home are in action and perhaps the power cables got ruptered by the digger machine.
The effect of all this was that my server networking accessability was affected and as I didn't have network I couldn't access it remotely anymore at a certain point the electricity was restored (and the UPS charge could keep the server up), however the server accessibility did not due restore until I asked a relative to restart it or under a more complicated cases where Tech aquanted guy has to help – Alexander (Alex) a close friend from school years check his old site here – alex.www.pc-freak.net helps a lot.to restart the machine physically either run a quick restoration commands on root TTY terminal or generally do check whether default router is reachable.

This kind of Pc-Freak.net downtime issues over the last month become too frequent (the machine was down about 5 times for 2 to 5 hours and this was too much (and weirdly enough it was not accessible from the internet even after electricity network was restored and the only solution to that was a physical server restart (from the Power Button).

To decrease the number of cases in which known relatives or friends has to  physically go to the server and restart it, each time after network or electricity outage I wrote a small script to check accessibility towards Default defined Network Gateway for my server with few ICMP packages sent with good old PING command
and trigger a network restart and system reboot
(in case if the network restart does fail) in a row.

1. Create reboot-if-nwork-is-downsh script under /usr/sbin or other dir

Here is the script itself:

 

#!/bin/sh
# Script checks with ping 5 ICMP pings 10 times to DEF GW and if so
# triggers networking restart /etc/inid.d/networking restart
# Then does another 5 x 10 PINGS and if ping command returns errors,
# Reboots machine
# This script is useful if you run home router with Linux and you have
# electricity outages and machine doesn't go up if not rebooted in that case

GATEWAY_HOST='192.168.0.1';

run_ping () {
for i in $(seq 1 10); do
    ping -c 5 $GATEWAY_HOST
done

}

reboot_f () {
if [ $? -eq 0 ]; then
        echo "$(date "+%Y-%m-%d %H:%M:%S") Ping to $GATEWAY_HOST OK" >> /var/log/reboot.log
    else
    /etc/init.d/networking restart
        echo "$(date "+%Y-%m-%d %H:%M:%S") Restarted Network Interfaces:" >> /tmp/rebooted.txt
    for i in $(seq 1 10); do ping -c 5 $GATEWAY_HOST; done
    if [ $? -eq 0 ] && [ $(cat /tmp/rebooted.txt) -lt ‘5’ ]; then
         echo "$(date "+%Y-%m-%d %H:%M:%S") Ping to $GATEWAY_HOST FAILED !!! REBOOTING." >> /var/log/reboot.log
        /sbin/reboot

    # increment 5 times until stop
    [[ -f /tmp/rebooted.txt ]] || echo 0 > /tmp/rebooted.txt
    n=$(< /tmp/rebooted.txt)
        echo $(( n + 1 )) > /tmp/rebooted.txt
    fi
    # if 5 times rebooted sleep 30 mins and reset counter
    if [ $(cat /tmprebooted.txt) -eq ‘5’ ]; then
    sleep 1800
        cat /dev/null > /tmp/rebooted.txt
    fi
fi

}
run_ping;
reboot_f;

You can download a copy of reboot-if-nwork-is-down.sh script here.

As you see in script successful runs  as well as its failures are logged on server in /var/log/reboot.log with respective timestamp.
Also a counter to 5 is kept in /tmp/rebooted.txt, incremented on each and every script run (rebooting) if, the 5 times increment is matched

a sleep is executed for 30 minutes and the counter is being restarted.
The counter check to 5 guarantees the server will not get restarted if access to Gateway is not continuing for a long time to prevent the system is not being restarted like crazy all time.
 

2. Create a cron job to run reboot-if-nwork-is-down.sh every 15 minutes or so 

I've set the script to re-run in a scheduled (root user) cron job every 15 minutes with following  job:

To add the script to the existing cron rules without rewriting my old cron jobs and without tempering to use cronta -u root -e (e.g. do the cron job add in a non-interactive mode with a single bash script one liner had to run following command:

 

{ crontab -l; echo "*/15 * * * * /usr/sbin/reboot-if-nwork-is-down.sh 2>&1 >/dev/null; } | crontab –


I know restarting a server to restore accessibility is a stupid practice but for home-use or small client servers with unguaranteed networks with a cheap Uninterruptable Power Supply (UPS) devices it is useful.

Summary

Time will show how efficient such a  "self-healing script practice is.
Even though I'm pretty sure that even in a Corporate businesses and large Public / Private Hybrid Clouds where access to remote mounted NFS / XFS / ZFS filesystems are failing a modifications of the script could save you a lot of nerves and troubles and unhappy customers / managers screaming at you on the phone 🙂


I'll be interested to hear from others who have a better  ideas to restore ( resurrect ) access to inessible Linux server after an outage.?
 

MySQL crashes after upgrade from MySQL to MariaDB and how to fix it

Tuesday, August 21st, 2018

how-to-fix-crashing-mysql-after-upgrade-to-mariadb-database-mariadb-logo.png

If you have recently upgraded your Debian / Ubuntu / CentOS Linux Server to the latest RPM / DEB packages as part of the upgrade you might have noticed the upgrade of MySQL Community Server  (which was bought by Oracle Corporation few years ago) is automatically upgraded to MariaDB (which is a MySQL fork made by the original developers of MySQL and guaranteed to stay open source. Just to name some of the Notable users include Wikipedia, WordPress.com and Google.).

You might have noticed MariaDB's restart script which is still under /etc/init.d/mysql  won't start and a quick check in /var/log/mysql.err | /var/log/mysql.log
shows errors of /usr/bin/mysqld crashing with errors like:

140502 14:13:05 [Note] Plugin 'FEDERATED' is disabled.
InnoDB: Log scan progressed past the checkpoint lsn 108 1057948207
140502 14:13:06  InnoDB: Database was not shut down normally!
InnoDB: Starting crash recovery.
InnoDB: Reading tablespace information from the .ibd files…
InnoDB: Restoring possible half-written data pages from the doublewrite
InnoDB: buffer…
InnoDB: Doing recovery: scanned up to log sequence number 108 1058059648
InnoDB: 1 transaction(s) which must be rolled back or cleaned up
InnoDB: in total 15 row operations to undo
InnoDB: Trx id counter is 0 562485504
140502 14:13:06  InnoDB: Starting an apply batch of log records to the database…
InnoDB: Progress in percents: 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99
InnoDB: Apply batch completed
InnoDB: Starting in background the rollback of uncommitted transactions
140502 14:13:06  InnoDB: Rolling back trx with id 0 562485192, 15 rows to undo
140502 14:13:06  InnoDB: Started; log sequence number 108 1058059648
140502 14:13:06  InnoDB: Assertion failure in thread 1873206128 in file ../../../storage/innobase/fsp/fsp0fsp.c line 1593
InnoDB: Failing assertion: frag_n_used > 0
InnoDB: We intentionally generate a memory trap.
InnoDB: Submit a detailed bug report to http://bugs.mysql.com.
InnoDB: If you get repeated assertion failures or crashes, even
InnoDB: immediately after the mysqld startup, there may be
InnoDB: corruption in the InnoDB tablespace. Please refer to
InnoDB: http://dev.mysql.com/doc/refman/5.1/en/forcing-recovery.html
InnoDB: about forcing recovery.
140502 14:13:06 – mysqld got signal 6 ;
This could be because you hit a bug. It is also possible that this binary
or one of the libraries it was linked against is corrupt, improperly built,
or misconfigured. This error can also be caused by malfunctioning hardware.
We will try our best to scrape up some info that will hopefully help diagnose
the problem, but since we have already crashed, something is definitely wrong
and this may fail.

key_buffer_size=16777216
read_buffer_size=131072
max_used_connections=0
max_threads=151
threads_connected=0
It is possible that mysqld could use up to
key_buffer_size + (read_buffer_size + sort_buffer_size)*max_threads = 345919 K
bytes of memory
Hope that's ok; if not, decrease some variables in the equation.

thd: 0x0
Attempting backtrace. You can use the following information to find out
where mysqld died. If you see no messages after this, something went
terribly wrong…
stack_bottom = (nil) thread_stack 0x30000
140502 14:13:06 [Note] Event Scheduler: Loaded 0 events
140502 14:13:06 [Note] /usr/sbin/mysqld: ready for connections.
Version: '5.1.41-3ubuntu12.10'  socket: '/var/run/mysqld/mysqld.sock'  port: 3306  (Ubuntu)
/usr/sbin/mysqld(my_print_stacktrace+0x2d) [0xb7579cbd]
/usr/sbin/mysqld(handle_segfault+0x494) [0xb7245854]
[0xb6fc0400]
/lib/tls/i686/cmov/libc.so.6(abort+0x182) [0xb6cc5a82]
/usr/sbin/mysqld(+0x4867e9) [0xb74647e9]
/usr/sbin/mysqld(btr_page_free_low+0x122) [0xb74f1622]
/usr/sbin/mysqld(btr_compress+0x684) [0xb74f4ca4]
/usr/sbin/mysqld(btr_cur_compress_if_useful+0xe7) [0xb74284e7]
/usr/sbin/mysqld(btr_cur_pessimistic_delete+0x332) [0xb7429e72]
/usr/sbin/mysqld(btr_node_ptr_delete+0x82) [0xb74f4012]
/usr/sbin/mysqld(btr_discard_page+0x175) [0xb74f41e5]
/usr/sbin/mysqld(btr_cur_pessimistic_delete+0x3e8) [0xb7429f28]
/usr/sbin/mysqld(+0x526197) [0xb7504197]
/usr/sbin/mysqld(row_undo_ins+0x1b1) [0xb7504771]
/usr/sbin/mysqld(row_undo_step+0x25f) [0xb74c210f]
/usr/sbin/mysqld(que_run_threads+0x58a) [0xb74a31da]

/usr/sbin/mysqld(trx_rollback_or_clean_all_without_sess+0x3e3) [0xb74ded43]
/lib/tls/i686/cmov/libpthread.so.0(+0x596e) [0xb6f9f96e]
/lib/tls/i686/cmov/libc.so.6(clone+0x5e) [0xb6d65a4e]
The manual page at http://dev.mysql.com/doc/mysql/en/crashing.html contains
information that should help you find out what is causing the crash.

Any recommendations?
mysql

I hoped to solve the /usr/bin/mysqld segfault error with server reboot as I though the problem is caused by the fact libc library was updated, but even a reboot did not solve it.

I've investigated online for a solution and found following MySQL corruption and recovery article.

The solution outlined there is very simple and comes to adding the line:
 

innodb_force_recovery = 1


to /etc/mysql/my.cnf

Assuming the mysql server is not running before restarting mariadb server.

1. Make a backup (Dump) of all MySQL tables

mysql:~# mysqldump -A > dump.sql

2. Drop all databases which need recovery.
You can do that from mysql cli or phpmyadmin

3. Stop mysqld.

mysql:~# /etc/init.d/mysql restart

4.  Remove /var/lib/mysql/ib*

mysql:~# rm -rf /var/lib/mysql/ib*

5. Comment out innodb_force_recovery in /etc/mysql/my.cnf

6. Restart mysqld. Look at mysql error log.
If everything is fine and you have problems with broken or missing databases the best thing next is to stop again mariadb and

7. Restore databases from the dump

mysql:~# mysql < dump.sql

 

 

 

Fixing Disappeared intel net link 5100 wifi on Toshiba Satellite L300 1YA

Friday, October 29th, 2010

I've recently had to fix a Toshiba Satellite L300 1YA notebook , running the shitty Windows Vista operating system.
For some reason suddenly the Net Link 5100 Wi-FI Network Adapter of the laptop mysteriously disappeared.
The led indicating the device is enabled was blinking the driver showed properly installed in Windows -> System and everything.
Even though that the wireless network scanning in the notebook was not working.

In networking in System there were two lines showing up with excalammation mark "!".

To fix the mess it took me 3.5 hours. I tried many things first logical thing I tried was reinstalling the driver with the latest available from Intel's website. Anyways this doesn't helped so I was about to think about other solutions.
What made thinks even worse was that the Vista installed was in Chineese!, yes you red this correctly chineese. You cannot imagine what a hell it is to deal with Windows whose language pack was Chineese …

The Vista had installed also the Chineese antivirus program Rising which also provided the system with some weird firewall and this made the dealing with the problem even more complicated.

After many tries I finally completely removed the wireless driver on the system and reinstalled it with the latest version from Intel's website.
Thanksfully after a couple of reboots and going into save mode the Intel Net Link 5100 started working again by itself?!

Well you know how things goes with Windows, you never know what will happen next.

Maybe a lot of notebooks suffer the same weird issue with Wireless Wi-Fi adapter suddenly stopping to work.

So now you know the solution, remove the driver install it again, restart and it should be working again.
Hope this quick and dirty article will save somebody an hours of time to figure it out …