Mirroring web site content ignoring the
robots.txt prohibition rules with wget on Linux
I wanted to mirror a content of a website which included a
robots.txt file with specific directories
Disallow rules
e.g. ,it included some code like for instance:
User-agent: *
Disallow: /privatedir/
Since the restriction on automated downloads on
/privatedir/
was at hand I needed to get around the restriction using some
command line downloaded like
wget . After
a quick
look online I found the wget FAQ which included a good
description on how to ignore the robots rules in robots.txt.
Furthermore I consulted with
wget's manual because
I wanted to mirror only a part from the whole website (mirror only
a data of a certain directory). Finally I ended with the following
wget rule which got me around
robots.txt Disallow
restrictions:
freebsd# wget -e robots=off --wait 3 --mirror --level 1
--convert-links
http://www.domaincom/privatedir/index.html
Issuing the above command mirrored the whole privatedir without any
restrains, here is what does the option
convert-links does:
--convert-links' - After the download is complete, convert the
links in the document to make them suitable for local viewing. This
affects not only the visible hyperlinks, but any part of the
document that links to external content, such as embedded images,
links to style sheets, hyperlinks to non-HTML content,
etc.
Also as you can see from the above command line I've used the
"--wait 3" because I wanted to be sure that some mod rewrite
regular expression rules on the server won't cut my access to the
/privatedir/ directory, because of the rapid file fetch.
The ignore of the robots.txt itself is done via the:
-e robots=off wget parameter.