How to make a mirror of a website in with on GNU
/ Linux with wget / Find total number of files in folder
Everyone who used Linux or BSD is probably familiar with wget or
has used this handy download console tools at least thousand of
times. Not so many Desktop GNU / Linux users like
Ubuntu and
Fedora Linux users had tried using wget to do something more
than single files download.
Actually
wget is not so popular as it used to be in earlier
linux days. I've noticed the tendency for newer Linux users to
prefer using
curl (I don't know why).
With all said I'm sure there is plenty of Linux users curious on
how a website mirror can be made through wget.
This article will briefly suggest few ways to do website mirroring
on linux / bsd as wget is both available on those two free
operating systems.
1. Most Simple exact mirror copy of website
The most basic use of wget's mirror capabilities is by using
wget's -mirror argument:
# wget -m
http://website-to-mirror.com/sub-directory/
Creating a mirror like this is not a very good practice, as the
links of the mirrored pages will still link to external URLs. In
other words link URL will not pointing to your local copy and
therefore if you're not connected to the internet and try to browse
random links of the webpage you will end up with many links which
are not opening because you don't have internet connection.
2. Mirroring with rewritting links to point to localhost and in
between download page delay
Making mirror with wget can put an heavy load on the remote server
as it fetches the files as quick as the bandwidth allows it. On
heavy servers rapid downloads with wget can significantly reduce
the download server responce time. Even on a some high-loaded
servers it can cause the server to hang completely.
Hence mirroring pages with wget without explicity setting delay in
between each page download, could be considered by remote server as
a kind of DoS - (denial of service) attack. Even some site
administrators have already set firewall rules or web server
modules configured like Apache mod_security which filter requests
to IPs which are doing too frequent HTTP GET /POST requests to the
web server.
To make wget delay with a 10 seconds download between mirrored
pages use:
# wget -mk -w 10 -np --random-wait
http://website-to-mirror.com/sub-directory/
The -mk stands for -m/-mirror and -k / shortcut argument for
--convert-links (make links point locally), --random-wait tells
wget to make random waits between o and 10 seconds between each
page download request.
3. Mirror / retrieve website sub directory ignoring robots.txt
"mirror restrictions" Some websites has a robots.txt which
restricts content download with clients like wget,
curl or even prohibits, crawlers to download their website
pages completely.
/robots.txt restrictions are not a problem as wget has an option to
disable robots.txt checking when downloading.
Getting around the robots.txt restrictions with wget is possible
through -e robots=off option.
For instance if you want to make a local mirror copy of the whole
sub-directory with all links and do it with a delay of 10
seconds between each consequential page request without reading at
all the robots.txt allow/forbid rules:
# wget -mk -w 10 -np -e robots=off --random-wait
http://website-to-mirror.com/sub-directory/
4. Mirror website which is prohibiting Download managers like
flashget, getright, go!zilla etc.
Sometimes when try to use wget to make a mirror copy of an
entire site domain subdirectory or the root site domain, you
get an error similar to:
Sorry, but the download manager you are using to view this site is
not supported.
We do not support use of such download managers as flashget,
go!zilla, or getright
This message is produced by the site dynamic generation language
PHP / ASP / JSP etc. used, as the website code is written to check
on the browser UserAgent sent.
wget's default sent UserAgent to the remote webserver is:
Wget/1.11.4
As this is not a common desktop browser useragent many
webmasters configure their websites to only accept well known
established desktop browser useragents sent by client
browsers.
Here are few typical user agents which identify a desktop
browser:
- Mozilla/5.0 (Windows NT 6.1; rv:6.0) Gecko/20110814
Firefox/6.0
- Mozilla/5.0 (X11; Linux i686; rv:6.0) Gecko/20100101
Firefox/6.0
- Mozilla/6.0 (Macintosh; I; Intel Mac OS X 11_7_9; de-LI;
rv:1.9b4) Gecko/2012010317 Firefox/10.0a4
- Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:2.2a1pre)
Gecko/20110324 Firefox/4.2a1pre
etc. etc.
If you're trying to mirror a website which has implied some kind of
useragent restriction based on some "valid" useragent,
wget
has the
-U option enabling you to fake the useragent.
If you get the
Sorry but the download manager you are using to
view this site is not supported ,
fake / change wget's
UserAgent with cmd:
wget -mk -w 10 -np -e robots=off \
--random-wait
--referer="http://www.google.com" \ --user-agent="Mozilla/5.0
(Windows; U; Windows NT 5.1; en-US; rv:1.8.1.6) Gecko/20070725
Firefox/2.0.0.6" \
--header="Accept:text/xml,application/xml,application/xhtml+xml,text/html;q=0.9,text/plain;q=0.8,image/png,*/*;q=0.5"
\ --header="Accept-Language: en-us,en;q=0.5" \
--header="Accept-Encoding: gzip,deflate" \
--header="Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7" \
--header="Keep-Alive: 300"
For the sake of some
wget anonimity - to make wget
permanently hide its user agent and
pretend like a Mozilla
Firefox running on MS Windows XP use
.wgetrc
like this in home directory.
5. Make a complete mirror of a website under a domain
name
To retrieve complete working copy of a site with wget a good way is
like so:
# wget -rkpNl5 -w 10 --random-wait
www.website-to-mirror.com
Where the arguments meaning is:
-r - Retrieve recursively
-k - Convert the links in documents to make them suitable for
local viewing
-p - Download everything (inline images, sounds and referenced
stylesheets etc.)
-N - Turn on time-stamping
-l5 - Specify recursion maximum depth level of 5
6. Make a dynamic pages static site mirror, by converting
CGI, ASP, PHP etc. to HTML for offline browsing
It is often websites pages are ending in a
.php / .asp /
.cgi ... extensions. An example of what I mean is for instance
the URL
http://php.net/manual/en/tutorial.php. You see the
url page is tutorial.php once mirrored with wget the local copy
will also end up in
.php and therefore will not be suitable
for local browsing as .php extension is not understood how to
interpret by the local browser.
Therefore to copy website with a non-html extension and make it
offline browsable in HTML there is the
--html-extension
option e.g.:
wget -mk -w 10 -np -e robots=off \
--random-wait
--convert-links http://www.website-to-mirror.com
A good practice in mirror making is to set a
download limit
rate. Setting such rate is both good for UP and DOWN side (the
local host where downloading and remote server).
download-limit is also useful when mirroring websites
consisting of many enormous files (documental movies, some music
etc.).
To set a download limit to add
--limit-rate= option. Passing
by to wget
--limit-rate=200K would limit download speed to
200KB.
Other useful thing to assure wget has made an accurate mirror is
wget logging. To use it pass
-o ./my_mirror.log to
wget.