Wget and User-Agent Header
As you may already know, Wget is a popular (particularly in the Unix world) command-line downloader and Web crawler application. You can read more about Wget in one of my earlier posts on the subject. One issue with Wget is that some sites block it from accessing their content. This is usually done by adding Wget to the robots.txt on the Web server and by configuring the server to reject requests with the user-agent header containing “wget”.
There are a couple of things you can do to circumvent these roadblocks. The robots.txt issue is dealt with simply by adding the “-e robots=off” option to the end of your Wget string (has to come after the URL), as shown in the example below.
wget -m -k "http://www.gnu.org/software/wget/" -e robots=off
A custom user-agent string can be set with Wget using the “-U ” option:
wget -m -k -U "Mozilla/5.0 (X11; U; Linux i686; pl-PL; rv:1.9.0.2) Gecko/20121223 Ubuntu/9.25 (jaunty) Firefox/3.8" "http://www.gnu.org/software/wget/" -e robots=off
Let’s say you wanted to use Wget to download a list of URLs going through different proxy servers and using different user-agent headers for each URL. You will need a list of proxy servers that looks similar to this:
202.175.3.112:80 119.40.99.2:8080 193.37.152.236:3128 83.2.83.44:8080 151.11.83.170:80 119.70.40.101:8080 208.78.125.18:80
You will need a list of URLs to be downloaded, one URL per line. And you will need a list of real user-agent strings (download one here), looking something like this:
Mozilla/5.0 (X11; U; Linux i686; pl-PL; rv:1.9.0.2) Gecko/20121223 Ubuntu/9.25 (jaunty) Firefox/3.8 Mozilla/5.0 (X11; U; Linux i686; pl-PL; rv:1.9.0.2) Gecko/2008092313 Ubuntu/9.25 (jaunty) Firefox/3.8 Mozilla/5.0 (X11; U; Linux i686; it-IT; rv:1.9.0.2) Gecko/2008092313 Ubuntu/9.25 (jaunty) Firefox/3.8 Mozilla/5.0 (X11; U; Linux x86_64; en-US; rv:1.9.2a1pre) Gecko/20090428 Firefox/3.6a1pre Mozilla/5.0 (X11; U; Linux x86_64; en-US; rv:1.9.2a1pre) Gecko/20090405 Firefox/3.6a1pre
The script below will pick a random proxy server from proxy_list.txt for each URL in url_list.txt and will use a random user-agent string from user_agents.txt, hopefully making the target Web server believe that you are hundreds of different users from all over the world.
#!/bin/bash proxies_total=$(wc -l proxy_list.txt | awk '{print $1}') user_agents_total=$(wc -l user_agents.txt | awk '{print $1}') cat url_list.txt | while read url do # Select a random proxy server from proxy_list.txt read proxy_server_random=$(cat proxy_list.txt | while read proxy_server do echo "`expr $RANDOM % $proxies_total`^$proxy_server" done | sort -n | sed 's/[0-9]*^//' | head -1) # Set the shell HTTP proxy variable export http_proxy="$proxy_server_random" # Select random user-agent from user_agents.txt user_agent_random=$(cat user_agents.txt | while read user_agent do echo "`expr $RANDOM % $user_agents_total`:$user_agent" done | sort -n | sed 's/[0-9]*://' | head -1) # Download the URL echo "Downloading $url" echo "Proxy server: $proxy_server_random" echo "User agent: $user_agent_random" $WGET -q --proxy=on -U "$user_agent_random" "$url" done
7 Comments »
1 Pingbacks »
-
[…] is a follow-up to my previous wget notes (1, 2, 3, 4). From time to time I find myself googling wget syntax even though I think I’ve used […]
Is there a way to see a list of all the directorys in the current directory I am on?
So say in the directory i am in I decide to do
mkdir test.
Is there a command that can show that directory in a list? So I can see the “test” directory along with all other in a list?
Also what is the differnse between
Yum and wget?
ok, everytime me and my husband have relationship i get really wet and my husband doesnt like it. he saids i get really wet that he doesnt enjoy so i would like to know what can i do to no wget wet so much… please help
i set up a virtual server running on ubuntu server. i tried apt-get, curl, wget, ftp and scp and none of them will download any packages. how can i get packages installed? it doesn’t even have vi.
My upload speed is very slow, downloading a file and uploading it onto my website would take too much time and bandwidth. Is there any way I can download a file from one website onto my own website? My web host doesn’t allow me SSH/telnet access so I can’t use wget. Any other ideas?
I don’t think FXP will work either unless you can somehow do HTTP to http://FTP...
I can’t log onto my webserver, at least for a command prompt, such as telnet or ssh. How else would you suggest I log onto my webserver that has the ability to download a file from another website?
When was wget (the linux download tool) written?
the s/w should measure the webpages downloaded by the browsers and any other content movies, etc that i may download either using torrents or something like wget….
essentially it should display how much data i have downloaded totally….
I have a problem with wget as the links are relative and i want them to be absolute.
I mean i am caching recursively and i want a link let’s say ../index.php like that in the page:
http://www.original_domain/index.php