Wget examples and scripts
Wget is a command-line Web browser for Unix and Windows. Wget can download Web pages and files; it can submit form data and follow links; it can mirror entire Web sites and make local copies. Wget is one of the most useful applications you would ever install on your computer and it is free.
You can download the latest version of Wget from the developers home page. Precompiled versions of Wget are available for Windows and for most flavors of Unix. Many Unix operating system have wget pre-installed, so type which wget to see if you already have it.
Wget supports a multitude of options and parameters. This variety may be confusing to people unfamiliar with Wget. You can view the available Wget options by typing wget –help or on a Unix box type man wget.
Here are a few useful examples on how to use Wget:
1) Download main page of Yahoo.com and save it as yahoo.htm
wget -O yahoo.htm http://www.yahoo.com
2) Use Wget with an HTTP firewall:
Set proxy in Korn or Bash shells
export http_proxy=proxy_server.domain:port_numberSet proxy in C-shell
setenv http_proxy proxy_server.domain:port_numberRun wget for anonymous proxy
wget -Y -O yahoo.htm "http://www.yahoo.com"Run wget for proxy that requires authentication
wget -Y --proxy-user=your_username --proxy-passwd=your_password -O yahoo.htm "http://www.yahoo.com"
3) Make a local mirror of Wget home page that you can browse from your hard drive
Here are the options we will use:
-m to mirror the site
-k to make all links local
-D to stay within the specified domain
–follow-ftp to follow FTP links
-np not to ascend to the parent directoryThe following two options are to deal with Web sites protected against automated download tools such as Wget:
-U to mascarade as a Mozilla browser
-e robots=off to ignore no-robots server directiveswget -U Mozilla -m -k -D gnu.org --follow-ftp -np "http://www.gnu.org/software/wget/" -e robots=off
4) Download all images from Playboy site
Here are the options we will use:
-r for recursive download
-l 0 for unlimited levels
-t 1 for one download attempt per link
-nd not to create local directories
-A to download only files with specified extentionswget -r -l 0 -U Mozilla -t 1 -nd -D playboy.com -A jpg,jpeg,gif,png "http://www.playboy.com" -e robots=off
5) Web image collector
The following Korn-shell script reads from a list of URLs and downloads all images found anywhere on those sites. The images are processed and all images smaller than a certain size are deleted. The remaining images are saved in a folder with named after the URL. The url_list.txt file contains one URL per line.
This script was originally written to run under AT&T UWIN on Windows, but it will also work in any native Unix environment that has Korn shell.
#!/bin/ksh # WIC.ksh - Web Image Collector # # WIC reads from a list of URLs and spiders each site recursively # downloading images that match specified criteria (type, size). #----------------------------------------------------------------- # ENVIRONMENT CONFIGURATION #----------------------------------------------------------------- WORKDIR="C:/Downloads" # Working directory cd "$WORKDIR" OUTPUT="$WORKDIR/output" # Final output directory URLS="$WORKDIR/url_list.txt" # List of URLs WGET="/usr/bin/wget" # Wget executable SIZE="+7k" # Minimum image size to keep TMPDIR1="$WORKDIR/tmp1" # Temporary directory 1 TMPDIR2="$WORKDIR/tmp2" # Temporary directory 2 TMPDIR3="$WORKDIR/tmp3" # Temporary directory 3 if [ ! -d "$WORKDIR" ] then mkdir "$WORKDIR" if [ ! -d "$WORKDiR" ] then echo "Download directory not found. Exiting..." exit 1 fi fi if [ ! -d "$OUTPUT" ] then mkdir "$OUTPUT" if [ ! -d "$OUTPUT" ] then echo "Cannot create $OUTPUT directory. Exiting..." exit 1 fi fi if [ ! -f "$URLS" ] then echo "URL list not found in $WORKDIR. Exiting..." exit 1 fi for i in 1 2 3 do if [ -d "$WORKDIR/tmp$i" ] then rm -r "$WORKDIR/tmp$i" mkdir "$WORKDIR/tmp$i" else mkdir "$WORKDIR/tmp$i" fi done if [ ! -f "$WGET" ] then echo "$WGET executable not found. Exiting..." exit 1 fi #----------------------------------------------------------------- # DOWNLOAD IMAGES #----------------------------------------------------------------- cat "$URLS" | while read URL do echo "Processing $URL" DOMAIN=$(echo "$URL" | awk -F'/' '{print $3}') if [ ! -d "$OUTPUT/$DOMAIN" ] then cd "$TMPDIR1" mkdir "$OUTPUT/$DOMAIN" $WGET --http-user=your_username --http-passwd=your_password -r -l 0 -U Mozilla -t 1 -nd -A jpg,jpeg,gif,png,pdf "$URL" -e robots=off find "$TMPDIR1" -type f -size "$SIZE" -exec mv {} "$OUTPUT/$DOMAIN" ; cd "$WORKDIR" else echo " $URL already processed. Skipping..." fi for i in 1 2 3 do if [ -d "$WORKDIR/tmp$i" ] then rm -r "$WORKDIR/tmp$i" mkdir "$WORKDIR/tmp$i" else mkdir "$WORKDIR/tmp$i" fi done done #----------------------------------------------------------------- # Remove empty download directories #----------------------------------------------------------------- cd "$OUTPUT" find . -type d | fgrep "./" | while read DIR do if [ `ls -R "$DIR" | wc -l | awk '{print $1}'` -eq 0 ] then rmdir "$DIR" fi done cd "$WORKDIR"
6) Wget options
GNU Wget 1.8.1+cvs, a non-interactive network retriever. Usage: wget [OPTION]... [URL]... Mandatory arguments to long options are mandatory for short options too. Startup: -V, --version display the version of Wget and exit. -h, --help print this help. -b, --background go to background after startup. -e, --execute=COMMAND execute a `.wgetrc'-style command. Logging and input file: -o, --output-file=FILE log messages to FILE. -a, --append-output=FILE append messages to FILE. -d, --debug print debug output. -q, --quiet quiet (no output). -v, --verbose be verbose (this is the default). -nv, --non-verbose turn off verboseness, without being quiet. -i, --input-file=FILE download URLs found in FILE. -F, --force-html treat input file as HTML. -B, --base=URL prepends URL to relative links in -F -i file. --sslcertfile=FILE optional client certificate. --sslcertkey=KEYFILE optional keyfile for this certificate. --egd-file=FILE file name of the EGD socket. Download: --bind-address=ADDRESS bind to ADDRESS (hostname or IP) on local host. -t, --tries=NUMBER set number of retries to NUMBER (0 unlimits). -O --output-document=FILE write documents to FILE. -nc, --no-clobber don't clobber existing files or use .# suffixes. -c, --continue resume getting a partially-downloaded file. --progress=TYPE select progress gauge type. -N, --timestamping don't re-retrieve files unless newer than local. -S, --server-response print server response. --spider don't download anything. -T, --timeout=SECONDS set the read timeout to SECONDS. -w, --wait=SECONDS wait SECONDS between retrievals. --waitretry=SECONDS wait 1...SECONDS between retries of a retrieval. --random-wait wait from 0...2*WAIT secs between retrievals. -Y, --proxy=on/off turn proxy on or off. -Q, --quota=NUMBER set retrieval quota to NUMBER. --limit-rate=RATE limit download rate to RATE. Directories: -nd --no-directories don't create directories. -x, --force-directories force creation of directories. -nH, --no-host-directories don't create host directories. -P, --directory-prefix=PREFIX save files to PREFIX/... --cut-dirs=NUMBER ignore NUMBER remote directory components. HTTP options: --http-user=USER set http user to USER. --http-passwd=PASS set http password to PASS. -C, --cache=on/off (dis)allow server-cached data (normally allowed). -E, --html-extension save all text/html documents with .html extension. --ignore-length ignore `Content-Length' header field. --header=STRING insert STRING among the headers. --proxy-user=USER set USER as proxy username. --proxy-passwd=PASS set PASS as proxy password. --referer=URL include `Referer: URL' header in HTTP request. -s, --save-headers save the HTTP headers to file. -U, --user-agent=AGENT identify as AGENT instead of Wget/VERSION. --no-http-keep-alive disable HTTP keep-alive (persistent connections). --cookies=off don't use cookies. --load-cookies=FILE load cookies from FILE before session. --save-cookies=FILE save cookies to FILE after session. FTP options: -nr, --dont-remove-listing don't remove `.listing' files. -g, --glob=on/off turn file name globbing on or off. --passive-ftp use the "passive" transfer mode. --retr-symlinks when recursing, get linked-to files (not dirs). Recursive retrieval: -r, --recursive recursive web-suck -- use with care! -l, --level=NUMBER maximum recursion depth (inf or 0 for infinite). --delete-after delete files locally after downloading them. -k, --convert-links convert non-relative links to relative. -K, --backup-converted before converting file X, back up as X.orig. -m, --mirror shortcut option equivalent to -r -N -l inf -nr. -p, --page-requisites get all images, etc. needed to display HTML page. Recursive accept/reject: -A, --accept=LIST comma-separated list of accepted extensions. -R, --reject=LIST comma-separated list of rejected extensions. -D, --domains=LIST comma-separated list of accepted domains. --exclude-domains=LIST comma-separated list of rejected domains. --follow-ftp follow FTP links from HTML documents. --follow-tags=LIST comma-separated list of followed HTML tags. -G, --ignore-tags=LIST comma-separated list of ignored HTML tags. -H, --span-hosts go to foreign hosts when recursive. -L, --relative follow relative links only. -I, --include-directories=LIST list of allowed directories. -X, --exclude-directories=LIST list of excluded directories. -np, --no-parent don't ascend to the parent directory.
27 Comments »
4 Pingbacks »
-
[…] Wget examples and scripts […]
-
[…] Wget examples and scripts […]
-
[…] is a follow-up to my previous wget notes (1, 2, 3, 4). From time to time I find myself googling wget syntax even though I think I’ve […]
how to use/write perl script in linux copy file from windows?
I need to execute perl script from linux server copy a file from windows drive folder.
example like:
the file is located at c:abcdxyz.log
i need to copy this xyz.log to a linux/unixbase server directory /var/opt/abc/xyz.log by using perl script on linux/unix base server.
how to write the perl script to perform the activity above?
any example?
Pls help! Thank You.
I seen many operating systems in Linux that are free. What is the best and takes up the least amount of space. Please also tell me the recommended space needed for the os.
computers have a different operating system. Other choices are also available, such as UNIX and Linux. Why do you think Windows is so prevalent? What do the other systems have to offer that Windows does not? Who uses the other systems?