How to delete million of files on busy Linux
servers (Work out Argument list too long)
If you try to delete more than
131072 of files on Linux with
rm -f *, where the files are all stored in the same
directory, you will get an error:
/bin/rm: Argument list too long.
I've earlier blogged on
deleting multiple files on Linux and FreeBSD and this is not my
first time facing this error.
Anyways, as time passed, I've found few other new ways to delete
large multitudes of files from a server.
In this article, I will explain shortly few approaches to
delete
few million of obsolete files to clean some space on your
server.
Here are
3 methods to use to clean your tons of junk files
from your server.
1. Using Linux find command to wipe out millions of
files
a.) Finding and deleting files using find's -exec
switch:
# find . -type f -exec rm -fv {} \;
This method works fine but it has 1 downside, file deletion is too
slow as for each found file external
rm command is
invoked.
For half a million of files or more using this method will take
"ages". However from a server hard disk stressing point of view it
is not so bad as, the files deletion is not putting too much strain
on the server hard disk.
b.) Finding and deleting big number of files with find's
-delete argument:
Luckily, there is a better way to delete the files, by using
find's command embedded
-delete argument:
# find . -type f -print -delete
c.) Deleting and printing out files to be deleted with find's
-print arg If you would like to output, what find is
deleting in "real time" on your terminal:
# find . -type f -print -delete
If you want to prevent your server hard disk from being stressed
and hence server normal operation "outages", it is good to combine
find command with
ionice, e.g.:
# ionice -c 3 find . -type f -print
-delete
Just note, that
ionice cannot guarantee the hard disk will
not be put heavy load with the find. On some heavily busy servers
with high amounts of disk i/o writes still applying the ionice will
not prevent the server from being hanged! Be sure to always keep an
eye on the server, while deleting the files nomatter with or
without ionice, if it gets lagged in serving its ordinary client
requests or whatever, stop the execution of the cmd.
2. Using a simple bash loop with rm command to delete
"tons" of files
An alternative way is to use a bash loop, to print each of the
files in the directory and issue
/bin/rm on each of the loop
elements (files) like so:
for i in *; do
rm -f $i;
done
If you'd like to print what you will be deleting add an echo to the
loop:
# for i in $(echo *); do \
echo "Deleting : $i"; rm -f $i; \
The bash loop, worked like a charm in my case so I really warmy
recommend this method, if you need to
delete more than 500 000+
files in a directory.
3. Deleting multiple files with perl
Deleting multiple files with perl is not a bad idea at all.
Here is a perl one liner, to delete all files contained within a
directory:
noah:~# perl -e
'for(<*>){((stat)[9]<(unlink))}'
If you prefer to use more human readable
perl script to delete a multitide of files use
delete_multple_files_in_dir_perl.pl
Using perl to delete thousand of files is quick really quick. I did
not benchmark it on the server, how quick exactly is it, but I
guess the delete rate should be similar to find command. Its
possible in some cases the perl loop is even quicker ...
4. Using PHP script to delete a multiple files
Using a short php script to delete in a loop similar to above bash
loop is also possible.
To do it with PHP, use this little PHP script:
<?php
$dir = "/path/to/dir/with/files";
$dh = opendir( $dir);
$i = 0;
while (($file = readdir($dh)) !== false) {
$file = "$dir/$file";
if (is_file( $file)) {
unlink( $file);
if (!(++$i % 1000)) {
echo "$i files removed\n";
}
}
}
?>
As you see the script reads the $dir defined directory and loops
through it, opening file by file and doing a delete to each of its
loop elements.
You should already know PHP is slow, so this method is only useful
if you have to delete many thousands of files on a shared hosting
server with
no (ssh) shell access.
The above
php script is taken from
Steve Kamerman's blog . I would like also to
express my big gratitude to Steve for writting such a wonderful
post and being inspiration for this article.
You can also
download the php delete million of files script sample
here
To use it rename
delete_millioon_of_files_in_a_dir.php.txt
to
delete_millioon_of_files_in_a_dir.php and run it through
a browser .
Note that you might need to run it multiple times, cause many
shared hosting servers are configured to
exit a php script which
keeps running for too long.
Alternatively the script can be run through shell with PHP:
php -l delete_millioon_of_files_in_a_dir.php.txt.
5. So What is the "best" way to delete million of files on
Linux?
In order to find out which method is quicker in terms of execution
time I did a home brew benchmarking on my thinkpad notebook.
a) Creating 509072 of sample files.
I used a shell for loop to create files for the sake of benchmark.
I didn't wanted to put this load on a productive server and hence I
used my own notebook to conduct the benchmarks. As my notebook is
not a server the benchmarks might be partially incorrect, however I
believe still .they're pretty good indicator on which deletion
method is better.
hipo@noah:~$ mkdir /tmp/test
hipo@noah:~$ cd /tmp/test;
hiponoah:/tmp/test$ for i in $(seq 1 509072); do echo aaaa >>
$i.txt; done
I had to wait few minutes until I have at hand 509072 of files
containing the sample "aaaa" string.
b) Calculating the number of files in the directory
Once the command was completed to make sure all the 509072 were
existing, calculated the directory number of files:
hipo@noah:/tmp/test$ find . -maxdepth 1 -type f |wc -l
509072
real 0m1.886s
user 0m0.440s
sys 0m1.332s
Its intesrsting, using an ls command to calculate the files
is less efficient than using find:
hipo@noah:/tmp/test ls -1 |wc -l
509072
real 0m3.355s
user 0m2.696s
sys 0m0.528s
c) benchmarking the different file deleting methods with
time
- Testing delete speed of find
hipo@noah:~/test$ time find . -maxdepth 1 -type f
-delete
real 15m40.853s
user 0m0.908s
sys 0m22.357s
You see, using find to delete the files is really, really
slow.
- How fast is perl loop in multitude file deletion ?
time perl -e 'for(<*>){((stat)[9]<(unlink))}' real
6m24.669s user 0m2.980s sys 0m22.673s
Deleting my sample 509072 took 6 mins and 24 secs. This is about 3
times faster than find! GO-GO perl :)
As you can see from the results, perl is a great and time saving,
way to delete 500 000 files.
- The approximate speed deletion rate of of for + rm bash
loop
hipo@noah:/tmp/test$ time for i in *; do rm -f $i;
done
real 206m15.081s
user 2m38.954s
sys 195m38.182s
You see the execution took 3 HOURS and 43 MINUTES!!!! This is
extremely slow ! But works like a charm as the running of deletion
didn't impacted my normal laptop browsing (of few not so heavy
websites and doing some stuff in gnome-terminal) :)
As you can imagine running a bash loop is a bit CPU intensive, but
puts less stress on the hard disk read/write operations. Therefore
using it is always suitable for deletion of many files on a dedi
servers
b) my production server file deleting experience On a production
server I only tested two of all the listed methods to delete files.
The production server, where I tested is running Debian GNU / Linux
Squeeze. I had to delete few million of files.
The tested methods there were:
i. The find . type -f -delete method.
ii. for i in *; do rm -f $i; done
The results from using find -delete method was quite sad, as
the server almost hanged under the heavy hard disk load the command
produced.
With the for script all went smoothly. The files were deleted for a
longer time, but the server continued with no interruptions..
While the bash loop was running, the server load avg. kept at
steady 4
If you're running a production, server and you're still wondering
which method to use to delete some multitude of files, I would
recommend you got he the bash for + rm way. It is extremely slow,
expect it run for some half an hour or so but puts not too much
extra load on the server..
Using the PHP script will probably be slow and inefficient, if
compared to both find and the a bash loop.. I didn't give it a try
yet, but suppose it will be either equal in time or at least few
times slower than bash.
If you have tried the php script and you have some observations,
please drop some comment to tell me how it performs.
To sum it up; Even though there are "hacks" to clean up some messy
parsing directory full of few million of junk files, having such a
directory should never exist on the first place.
Keeping million of files within the same directory is very stupid
idea.
Doing so will have a severe negative impact on a directory listing
performance of your filesystem in the long term.
If you know better (more efficient) ways to delete a multitude of
files in a dir please share in comments.