
If you try to delete more than 131072 of files on Linux with rm -f *, where the files are all stored in the same directory, you will get an error:
/bin/rm: Argument list too long.
I've earlier blogged on deleting multiple files on Linux and FreeBSD and this is not my first time facing this error.
Anyways, as time passed, I've found few other new ways to delete large multitudes of files from a server.
In this article, I will explain shortly few approaches to delete few million of obsolete files to clean some space on your server.
Here are 3 methods to use to clean your tons of junk files.
1. Using Linux find command to wipe out millions of files
a.) Finding and deleting files using find's -exec switch:
# find . -type f -exec rm -fv {} \;
This method works fine but it has 1 downside, file deletion is too slow as for each found file external rm command is invoked.
For half a million of files or more, using this method will take "long". However from a server hard disk stressing point of view it is not so bad as, the files deletion is not putting too much strain on the server hard disk.
b.) Finding and deleting big number of files with find's -delete argument:
Luckily, there is a better way to delete the files, by using find's command embedded -delete argument:
# find . -type f -print -delete
c.) Deleting and printing out deleted files with find's -print arg
If you would like to output on your terminal, what files find is deleting in "real time" add -print:
# find . -type f -print -delete
To prevent your server hard disk from being stressed and hence save your self from server normal operation "outages", it is good to combine find command with ionice, e.g.:
# ionice -c 3 find . -type f -print -delete
Just note, that ionice cannot guarantee find's opeartions will not affect severely hard disk i/o requests. On heavily busy servers with high amounts of disk i/o writes still applying the ionice will not prevent the server from being hanged! Be sure to always keep an eye on the server, while deleting the files nomatter with or without ionice. if throughout find execution, the server gets lagged in serving its ordinary client requests or whatever, stop the execution of the cmd immediately by killing it from another ssh session or tty (if physically on the server).
2. Using a simple bash loop with rm command to delete "tons" of files
An alternative way is to use a bash loop, to print each of the files in the directory and issue /bin/rm on each of the loop elements (files) like so:
for i in *; do
rm -f $i;
done
If you'd like to print what you will be deleting add an echo to the loop:
# for i in $(echo *); do \
echo "Deleting : $i"; rm -f $i; \
The bash loop, worked like a charm in my case so I really warmly recommend this method, whenever you need to delete more than 500 000+ files in a directory.
3. Deleting multiple files with perl
Deleting multiple files with perl is not a bad idea at all.
Here is a perl one liner, to delete all files contained within a directory:
# perl -e 'for(<*>){((stat)[9]<(unlink))}'
If you prefer to use more human readable perl script to delete a multitide of files use delete_multple_files_in_dir_perl.pl
Using perl interpreter to delete thousand of files is quick, really, really quick.
I did not benchmark it on the server, how quick exactly is it, but I guess the delete rate should be similar to find command. Its possible even in some cases the perl loop is quicker …
4. Using PHP script to delete a multiple files
Using a short php script to delete files file by file in a loop similar to above bash script is another option.
To do deletion with PHP, use this little PHP script:
<?php
$dir = "/path/to/dir/with/files";
$dh = opendir( $dir);
$i = 0;
while (($file = readdir($dh)) !== false) {
$file = "$dir/$file";
if (is_file( $file)) {
unlink( $file);
if (!(++$i % 1000)) {
echo "$i files removed\n";
}
}
}
?>
As you see the script reads the $dir defined directory and loops through it, opening file by file and doing a delete for each of its loop elements.
You should already know PHP is slow, so this method is only useful if you have to delete many thousands of files on a shared hosting server with no (ssh) shell access.
This php script is taken from Steve Kamerman's blog . I would like also to express my big gratitude to Steve for writting such a wonderful post. His post actually become inspiration for this article to become reality.
You can also download the php delete million of files script sample here
To use it rename delete_millioon_of_files_in_a_dir.php.txt to delete_millioon_of_files_in_a_dir.php and run it through a browser .
Note that you might need to run it multiple times, cause many shared hosting servers are configured to exit a php script which keeps running for too long.
Alternatively the script can be run through shell with PHP cli:
php -l delete_millioon_of_files_in_a_dir.php.txt.
5. So What is the "best" way to delete million of files on Linux?
In order to find out which method is quicker in terms of execution time I did a home brew benchmarking on my thinkpad notebook.
a) Creating 509072 of sample files.
Again, I used bash loop to create many thousands of files in order to benchmark.
I didn't wanted to put this load on a productive server and hence I used my own notebook to conduct the benchmarks. As my notebook is not a server the benchmarks might be partially incorrect, however I believe still .they're pretty good indicator on which deletion method would be better.
hipo@noah:~$ mkdir /tmp/test
hipo@noah:~$ cd /tmp/test;
hiponoah:/tmp/test$ for i in $(seq 1 509072); do echo aaaa >> $i.txt; done
I had to wait few minutes until I have at hand 509072 of files created. Each of the files as you can read is containing the sample "aaaa" string.
b) Calculating the number of files in the directory
Once the command was completed to make sure all the 509072 were existing, I used a find + wc cmd to calculate the directory contained number of files:
hipo@noah:/tmp/test$ find . -maxdepth 1 -type f |wc -l
509072
real 0m1.886s
user 0m0.440s
sys 0m1.332s
Its intesrsting, using an ls command to calculate the files is less efficient than using find:
hipo@noah:/tmp/test$ time ls -1 |wc -l
509072
real 0m3.355s
user 0m2.696s
sys 0m0.528s
c) benchmarking the different file deleting methods with time
– Testing delete speed of find
hipo@noah:/tmp/test$ time find . -maxdepth 1 -type f -delete
real 15m40.853s
user 0m0.908s
sys 0m22.357s
You see, using find to delete the files is not either too slow nor light quick.
– How fast is perl loop in multitude file deletion ?
hipo@noah:/tmp/test$ time perl -e 'for(<*>){((stat)[9]<(unlink))}'real 6m24.669suser 0m2.980ssys 0m22.673s
Deleting my sample 509072 took 6 mins and 24 secs. This is about 3 times faster than find! GO-GO perl 🙂
As you can see from the results, perl is a great and time saving, way to delete 500 000 files.
– The approximate speed deletion rate of of for + rm bash loop
hipo@noah:/tmp/test$ time for i in *; do rm -f $i; done
real 206m15.081s
user 2m38.954s
sys 195m38.182s
You see the execution took 195m en 38 secs = 3 HOURS and 43 MINUTES!!!! This is extremely slow ! But works like a charm as the running of deletion didn't impacted my normal laptop browsing. While the script was running I was mostly browsing through few not so heavy (non flash) websites and doing some other stuff in gnome-terminal) 🙂
As you can imagine running a bash loop is a bit CPU intensive, but puts less stress on the hard disk read/write operations. Therefore its clear using it is always a good practice when deletion of many files on a dedi servers is required.
b) my production server file deleting experience
On a production server I only tested two of all the listed methods to delete my files. The production server, where I tested is running Debian GNU / Linux Squeeze 6.0.3. There I had a task to delete few million of files.
The tested methods tried on the server were:
– The find . type -f -delete method.
– for i in *; do rm -f $i; done
The results from using find -delete method was quite sad, as the server almost hanged under the heavy hard disk load the command produced.
With the for script all went smoothly. The files were deleted for a long long time (like few hours), but while it was running, the server continued with no interruptions..
While the bash loop was running, the server load avarage kept at steady 4
Taking my experience in mind, If you're running a production, server and you're still wondering which delete method to use to wipe some multitude of files, I would recommend you go the bash for loop + /bin/rm way. Yes, it is extremely slow, expect it run for some half an hour or so but puts not too much extra load on the server..
Using the PHP script will probably be slow and inefficient, if compared to both find and the a bash loop.. I didn't give it a try yet, but suppose it will be either equal in time or at least few times slower than bash.
If you have tried the php script and you have some observations, please drop some comment to tell me how it performs.
To sum it up;
Even though there are "hacks" to clean up some messy parsing directory full of few million of junk files, having such a directory should never exist on the first place.
Frankly, keeping millions of files within the same directory is very stupid idea.
Doing so will have a severe negative impact on a directory listing performance of your filesystem in the long term.
If you know better (more efficient) ways to delete a multitude of files in a dir please share in comments.
How to resolve (fix) WordPress wp-cron.php errors like “POST /wp-cron.php?doing_wp_cron HTTP/1.0″ 404” / What is wp-cron.php and what it does
Monday, March 12th, 2012One of the WordPress websites hosted on our dedicated server produces all the time a wp-cron.php 404 error messages like:
xxx.xxx.xxx.xxx - - [15/Apr/2010:06:32:12 -0600] "POST /wp-cron.php?doing_wp_cron HTTP/1.0
I did not know until recently, whatwp-cron.php does, so I checked in google and red a bit. Many of the places, I've red are aa bit unclear and doesn't give good exlanation on what exactly wp-cron.php does. I wrote this post in hope it will shed some more light on wp-config.php and how this major 404 issue is solved..
So
what is wp-cron.php doing?
Suppose you're writting a new post and you want to take advantage of WordPress functionality to schedule a post to appear Online at specific time:
The Publish Immediately, field execution is being issued on the scheduled time thanks to the wp-cron.php periodic invocation.
Another example for wp-cron.php operation is in handling flushing of WP old HTML Caches generated by some wordpress caching plugin like W3 Total Cache
wp-cron.php takes care for dozens of other stuff silently in the background. That's why many wordpress plugins are depending heavily on wp-cron.php proper periodic execution. Therefore if something is wrong with wp-config.php, this makes wordpress based blog or website partially working or not working at all.
Our company wp-cron.php errors case
In our case the:
212.235.185.131 – – [15/Apr/2010:06:32:12 -0600] "POST /wp-cron.php?doing_wp_cron HTTP/1.0" 404
is occuring in Apache access.log (after each unique vistor request to wordpress!.), this is cause wp-cron.php is invoked on each new site visitor site request.
This puts a "vain load" on the Apache Server, attempting constatly to invoke the script … always returning not found 404 err.
As a consequence, the WP website experiences "weird" problems all the time. An illustration of a problem caused by the impoper wp-cron.php execution is when we are adding new plugins to WP.
Lets say a new wordpress extension is download, installed and enabled in order to add new useful functioanlity to the site.
Most of the time this new plugin would be malfunctioning if for example it is prepared to add some kind of new html form or change something on some or all the wordpress HTML generated pages.WP cache directory is manually deleted with rm -rf /var/www/blog/wp-content/cache/…
This troubles are result of wp-config.php's inability to update settings in wp SQL database, after each new user request to our site.
So the newly added plugin website functionality is not showing up at all, until
I don't know how thi whole wp-config.php mess occured, however my guess is whoever installed this wordpress has messed something in the install procedure.
Anyways, as I researched thoroughfully, I red many people complaining of having experienced same wp-config.php 404 errs. As I red, most of the people troubles were caused by their shared hosting prohibiting the wp-cron.php execution.
It appears many shared hostings providers choose, to disable the wordpress default wp-cron.php execution. The reason is probably the script puts heavy load on shared hosting servers and makes troubles with server overloads.
Anyhow, since our company server is adedicated server I can tell for sure in our case wordpress had no restrictions for how and when wp-cron.php is invoked.
I've seen also some posts online claiming, the wp-cron.php issues are caused of improper localhost records in /etc/hosts, after a thorough examination I did not found any hosts problems:
hipo@debian:~$ grep -i 127.0.0.1 /etc/hosts
127.0.0.1 localhost.localdomain localhost
You see from below paste, our server, /etc/hosts has perfectly correct 127.0.0.1 records.
Changing default way wp-cron.php is executed
As I've learned it is generally a good idea for WordPress based websites which contain tens of thousands of visitors, to alter the default way wp-cron.php is handled. Doing so will achieve some efficiency and improve server hardware utilization.
Invoking the script, after each visitor request can put a heavy "useless" burden on the server CPU. In most wordpress based websites, the script did not need to make frequent changes in the DB, as new comments in posts did not happen often. In most wordpress installs out there, big changes in the wordpress are not common.
Therefore, a good frequency to exec wp-cron.php, for wordpress blogs getting only a couple of user comments per hour is, half an hour cron routine.
To disable automatic invocation of wp-cron.php, after each visitor request open /var/www/blog/wp-config.php and nearby the line 30 or 40, put:
define('DISABLE_WP_CRON', true);
An important note to make here is that it makes sense the position in wp-config.php, where define('DISABLE_WP_CRON', true); is placed. If for instance you put it at the end of file or near the end of the file, this setting will not take affect.
With that said be sure to put the variable define, somewhere along the file initial defines or it will not work.
Next, with Apache non-root privileged user lets say www-data, httpd, www depending on the Linux distribution or BSD Unix type add a php CLI line to invoke wp-cron.php every half an hour:
linux:~# crontab -u www-data -e
0,30 * * * * cd /var/www/blog; /usr/bin/php /var/www/blog/wp-cron.php 2>&1 >/dev/null
To assure, the php CLI (Command Language Interface) interpreter is capable of properly interpreting the wp-cron.php, check wp-cron.php for syntax errors with cmd:
linux:~# php -l /var/www/blog/wp-cron.php
No syntax errors detected in /var/www/blog/wp-cron.php
That's all, 404 wp-cron.php error messages will not appear anymore in access.log! 🙂
Just for those who can find the root of the /wp-cron.php?doing_wp_cron HTTP/1.0" 404 and fix the issue in some other way (I'll be glad to know how?), there is also another external way to invoke wp-cron.php with a request directly to the webserver with short cron invocation via wget or lynx text browser.
– Here is how to call wp-cron.php every half an hour with lynxPut inside any non-privileged user, something like:
01,30 * * * * /usr/bin/lynx -dump "http://www.your-domain-url.com/wp-cron.php?doing_wp_cron" 2>&1 >/dev/null
– Call wp-cron.php every 30 mins with wget:
01,30 * * * * /usr/bin/wget -q "http://www.your-domain-url.com/wp-cron.php?doing_wp_cron"
Invoke the wp-cron.php less frequently, saves the server from processing the wp-cron.php thousands of useless times.
Altering the way wp-cron.php works should be seen immediately as the reduced server load should drop a bit.
Consider you might need to play with the script exec frequency until you get, best fit cron timing. For my company case there are only up to 3 new article posted a week, hence too high frequence of wp-cron.php invocations is useless.
With blog where new posts occur once a day a script schedule frequency of 6 up to 12 hours should be ok.
Tags: akismet, Auto, caches, checks, commentor, cr, cron, daySuppose, dedicated server, doesn, dozens, Draft, email, error messages, execution, exlanation, file, google, HTML, HTTP, invocation, localhost, nbsp, newsletter, operation, periodic execution, php, plugin, quot, request, scheduler, someone, something, spam, SQL, time, time thanks, Wordpress, wordpress plugins, wp
Posted in System Administration, Web and CMS, Wordpress | 3 Comments »