Wget Examples
This is a follow-up to my previous wget
notes (1, 2, 3, 4). From time to time I find myself googling wget
syntax even though I think I’ve used every option of this excellent utility over the years. Perhaps my memory is not what it used to be, but I’m probably the most frequent visitor to my own Web site… Anyway, here’s the grand list of the more useful wget
snippets.
Download tar.gz and uncompress with a single command:
wget -q ${url}/archive.tar.gz -O - | tar xz
Download tar.bz2 and uncompress with a single command:
wget -q ${url}/archive.tar.bz2 -O - | tar xj
Download in background, limit bandwidth to 200KBps, do not ascend to parent URL, download only newer files, do not create new directories, download only htm*,php and, pdf, set 5-second timeout per link:
wget -b --limit-rate=200k -np -N -m -nd --accept=htm,html,php,pdf --wait=5 "${url}"
Download recursively, span multiple hosts, convert links to local, limit recursion level to 4, fake “mozilla” user agent, ignore “robots” directives:
wget -r -H --convert-links --level=4 --user-agent=Mozilla "${url}" -e robots=off
Generate a list of broken links:
wget --spider -o broken_links.log --wait 2 -r -p "${url}" -e robots=off
Download new PDFs from a list of URLs:
wget -r --level=1 -H --timeout=2 -nd -N -np --accept=pdf -e robots=off -i urls.txt
Save and use authentication cookie:
wget -O ~/.domain_cookie_tmp "https://domain.com/login.cgi?login=${username}&password=${password}"
grep "^cookie" ~/.domain_cookie_tmp | awk -F'=' '{print $2}' > ~/.domain_cookie wget -c --no-cookies --header="Cookie: enc=`cat ~/.domain_cookie`" -i "${url_file}" -nc
Use wget with anonymous proxy:
export http_proxy=proxy_server:port wget -Y -O /tmp/yahoo.htm "http://www.yahoo.com"
Use wget with authorized proxy:
export http_proxy=proxy_server:port wget -Y --proxy-user=${username} --proxy-passwd=${password} \ -O /tmp/yahoo.htm "http://www.yahoo.com"
Make a local mirror of a Web site, including FTP links; limit rate to 50kbps; set link timeout to 5s; ignore robots directive; randomize access rate:
wget -U Mozilla -m -k -D ${domain} --follow-ftp \ --limit-rate=50k --wait=5 --random-wait -np "${url}" -e robots=off
Download images from a Web site:
wget -r -l 0 -U Mozilla -t 1 -nd -D ${domain} \ -A jpg,jpeg,gif,png "${url}" -e robots=off
Extract a list of HTTP(S) and FTP(S) links from a single URL:
wget -qO- "${url}" | grep -oE "(https?|ftps?)://[^\<\>\"\' ]+" | sort -u
Mirror a subfolder of a site:
wget -mk -w 20 -np ${url}
Update only changed files:
wget -mk -w 20 -N "${url}"
Mirror site with random delay between requests:
wget -w 20 --random-wait -mk "${url}"
Download a list of URLs from a file:
wget -i "${url_file}"
Resume interrupted file download:
wget -c "${file_url}"
Download files in the background:
wget -c "${url}"
Download the first two levels of pages from a site:
wget -r -l2 "${url}"
Make a static copy of a dynamic Web site two levels deep:
wget -P /var/www/html/ -mpck -l2 --user-agent="Mozilla" -e robots=off -E "${url}"