Saturday, January 16, 2010

Wget Trick to Download from Restrictive Sites

SkyHi @ Saturday, January 16, 2010

Before
wget 403 Forbidden
After trick
wget bypassing restrictions
I am often logged in to my servers via SSH, and I need to download a file like a WordPress plugin. I’ve noticed many sites now employ a means of blocking robots like wget from accessing their files. Most of the time they use .htaccess to do this. So a permanent workaround has wget mimick a normal browser.


Update

function wgets()
{
wget --referer="http://www.google.com" --user-agent="Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.6) Gecko/20070725 Firefox/2.0.0.6" \
--header="Accept:text/xml,application/xml,application/xhtml+xml,text/html;q=0.9,text/plain;q=0.8,image/png,*/*;q=0.5" \
--header="Accept-Language: en-us,en;q=0.5" \
--header="Accept-Encoding: gzip,deflate" \
--header="Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7" \
--header="Keep-Alive: 300" "$@"
}

Using alias

Add this to your .bash_profile or other shell startup script, or just type it at the prompt. Now just run wget from the command line as usual, i.e. wget -dnv http://www.askapache.com/sitemap.xml.

alias wget='wget --referer="http://www.google.com" --user-agent="Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.6) Gecko/20070725 Firefox/2.0.0.6" --header="Accept:<br />text/xml,application/xml,application/xhtml+xml,text/html;q=0.9,text/plain;q=0.8,image/png,*/*;q=0.5" --header="Accept-Language: en-us,en;q=0.5" --header="Accept-Encoding: gzip,deflate"<br />--header="Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7" --header="Keep-Alive: 300"'<br />

Using custom .wgetrc

Alternatively, you could instead just create or modify your $HOME/.wgetrc file like this. Or download and rename to .wgetrc.wgetrc. Now just run wget from the command line as usual, i.e. wget -dnv http://www.askapache.com/sitemap.xml.

###<br />### Sample Wget initialization file .wgetrc by http://www.askapache.com<br />###<br />##<br />## Local settings (for a user to set in his $HOME/.wgetrc).  It is<br />## *highly* undesirable to put these settings in the global file, since<br />## they are potentially dangerous to "normal" users.<br />##<br />## Even when setting up your own ~/.wgetrc, you should know what you<br />## are doing before doing so.<br />##<br /> <br />header = Accept-Language: en-us,en;q=0.5<br />header = Accept: text/xml,application/xml,application/xhtml+xml,text/html;q=0.9,text/plain;q=0.8,image/png,*/*;q=0.5<br />header = Accept-Encoding: gzip,deflate<br />header = Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7<br />header = Keep-Alive: 300<br />user_agent = Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.6) Gecko/20070725 Firefox/2.0.0.6<br />referer = http://www.google.com<br />

From the command line

wget --referer="http://www.google.com" --user-agent="Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.6) Gecko/20070725 Firefox/2.0.0.6" --header="Accept:<br />text/xml,application/xml,application/xhtml+xml,text/html;q=0.9,text/plain;q=0.8,image/png,*/*;q=0.5" --header="Accept-Language: en-us,en;q=0.5" --header="Accept-Encoding: gzip,deflate"<br />--header="Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7" --header="Keep-Alive: 300" -dnv http://www.askapache.com/sitemap.xml<br /><br /><br />Reference: <a href="http://www.askapache.com/dreamhost/wget-header-trick.html">http://www.askapache.com/dreamhost/wget-header-trick.html</a><br />

REFERENCE
http://www.askapache.com/dreamhost/wget-header-trick.html