Mirroring web site content ignoring the robots.txt prohibition rules with wget on Linux

Tue May 4 12:08:31 EEST 2010

I wanted to mirror a content of a website which included a robots.txt file with specific directories Disallow rules e.g. ,it included some code like for instance:

User-agent: *

Disallow: /privatedir/

Since the restriction on automated downloads on /privatedir/ was at hand I needed to get around the restriction using some command line downloaded like wget . After a quick look online I found the wget FAQ which included a good description on how to ignore the robots rules in robots.txt.
Furthermore I consulted with wget's manual because I wanted to mirror only a part from the whole website (mirror only a data of a certain directory). Finally I ended with the following wget rule which got me around robots.txt Disallow restrictions:

freebsd# wget -e robots=off --wait 3 --mirror --level 1
--convert-links
http://www.domaincom/privatedir/index.html

Issuing the above command mirrored the whole privatedir without any restrains, here is what does the option convert-links does:

--convert-links' - After the download is complete, convert the links in the document to make them suitable for local viewing. This affects not only the visible hyperlinks, but any part of the document that links to external content, such as embedded images, links to style sheets, hyperlinks to non-HTML content, etc.

Also as you can see from the above command line I've used the "--wait 3" because I wanted to be sure that some mod rewrite regular expression rules on the server won't cut my access to the /privatedir/ directory, because of the rapid file fetch.
The ignore of the robots.txt itself is done via the:
-e robots=off wget parameter.

Posted by hip0 | Permanent link | Comments