Wget Google image collector
Google Images is an extremely useful tool for webmasters, designers, editors, and just about anybody else who’s in a hurry to find just the right photo or clipart. However, this Google tool has a couple of annoying issues. First, the collection of images is updated once in a while. Not very often. This means that whenever you get your search results, there will be a bunch of dead links. Second, many Web sites prevent direct linking of images, so you cannot open just the image – you have to open the entire Web page where the image is located.
The idea behind the following BASH shell is to address these two small problems, as well as to save you some time downloading images. The script reads from a list of keywords formatted exactly as you would format your Google search query. For example, to find a bunch of photos of the Red Square in Moscow you would type something like: +”red square” +moscow. This will search for “red square” as a phrase plus the word “moscow”.
The search query is passed to Google as a URL along with the desired image size (icon, small, medium, large, xlarge, xxlarge). If you are looking only for the largest photos, you would specify the image size as “xxlarge”. However, if you are looking for image size from “large” and up, you would use the “pipe” character in place of a logical “OR”: large|xlarge|xxlarge. See the script below for an example.
The following step for the script is to count the number of search hits and download all Google pages listing the search results. A search result may say something like: “found 250 matches, displaying images 1-20”. Once again, the script formats an appropriate URL and sends it to Google to get search results 21-40, 41-60, etc. until it reaches 250.
All the search results are parsed to extract only the image URLs linking to the actual photos on various sites. These URLs are dumped into a file and fed to Wget. The direct linking issue is no longer a concern, because you open each image URL directly and your “HTTP Referrer” is blank. The downloaded images are written to a separate directory for each search query. A minimum file size filter is applied to the downloaded photos: all images smaller than, say, 30Kb are deleted. ImageMagick is used to enlarge images if they are smaller than a certain size.
You will need to create a file called “keywords.txt” in the same directory as the script. The file should contain one search query per line. Here’s an example:
+"red square" +moscow +"times square" +"new york" +bush +iraq +wmd +"big fat liar"
And here’s the script:
#!/bin/bash # Collect and resize images from Google Images based on a keyword. #------------------------------ # VARIABLES #------------------------------ DATETIME=$(date +'%Y-%m-%d') HOMEDIR="/WD120GB_01/misc/google" SIZE="large|xlarge|xxlarge" #SIZE="icon|small|medium|large|xlarge|xxlarge" #------------------------------ # CONFIGURATION #------------------------------ if [ ! -d "$HOMEDIR" ] then echo "Home directory $HOMEDIR not found. Exiting..." exit 1 fi if [ ! -f "${HOMEDIR}/keywords.txt" ] then echo "Keyword list not found. Exiting..." exit 1 fi #------------------------------ # IMAGE SEARCH & DOWNLOAD #------------------------------ STATUS=1 i=0 cat "${HOMEDIR}/keywords.txt" | while read KEYWORD do OUTDIR=$(echo "$KEYWORD" | sed 's/ /_/g' | sed 's/+/_/g' | sed "s/'//g" | sed 's/"//g' | sed 's/*//g' | sed 's/^_//g' | sed 's/__/_/g') OUTDIR=$(echo "${HOMEDIR}/${OUTDIR}_${DATETIME}") if [ ! -d "${OUTDIR}" ] then mkdir "${OUTDIR}" else echo "Directory ${OUTDIR} already exists. Skipping..." STATUS=0 fi if [ $STATUS -eq 1 ] then URL="http://images.google.com/images?q=${KEYWORD}&svnum=100&hl=en&lr=&safe=off&sa=G&imgsz=${SIZE}" wget -U Mozilla -O "${OUTDIR}/results_${i}.txt" "$URL" -e robots=off cat "${OUTDIR}/results_${i}.txt" | sed 's/href/n/g' | grep imgurl | grep imgrefurl | sed 's/imgurl=/@/g' | sed 's/&imgrefurl/@/g' | awk -F'@' '{print $2}' > "${OUTDIR}/image_urls.txt" results=$(cat "${OUTDIR}/results_${i}.txt" | sed 's/border/n/g' | fgrep '&start=' | fgrep -v '&start=0' | sort | uniq | fgrep ' ' | wc -l | awk '{print $1}') i=1 while [ $i -lt $results ] do (( START = i * 20 )) URL="http://images.google.com/images?q=${KEYWORD}&svnum=100&hl=en&lr=&safe=off&sa=G&imgsz=${SIZE}&start=${START}" wget -U Mozilla -O "${OUTDIR}/results_${i}.txt" "$URL" -e robots=off cat "${OUTDIR}/results_${i}.txt" | sed 's/href/n/g' | grep imgurl | grep imgrefurl | sed 's/imgurl=/@/g' | sed 's/&imgrefurl/@/g' | awk -F'@' '{print $2}' >> "${OUTDIR}/image_urls.txt" (( i = i + 1 )) done find "$OUTDIR" -type f -name "results_*.txt" -exec rm {} ; cat "${OUTDIR}/image_urls.txt" | fgrep '.jpg' | sort | uniq > /tmp/google_image_collector.tmp mv /tmp/google_image_collector.tmp "${OUTDIR}/image_urls.txt" if [ -f "${OUTDIR}/image_urls.txt" ] then clear COUNT=$(cat "${OUTDIR}/image_urls.txt" | wc -l | awk '{print $1}') echo "Found $COUNT images matching $KEYWORD" j=1 cat "${OUTDIR}/image_urls.txt" | while read LINE do wget -U Mozilla -nd -t 1 -T 5 -O "${OUTDIR}/photo_${j}.jpg" "$LINE" -e robots=off if [ `ls -als "${OUTDIR}/photo_${j}.jpg" | awk '{print $6}'` -lt 10000 ] then rm "${OUTDIR}/photo_${j}.jpg" else convert -filter Cubic -resize '640x640< ' "${OUTDIR}/photo_${j}.jpg" "${OUTDIR}/photo_${j}.jpg" (( j = j + 1 )) fi done fi fi done
18 Comments »
2 Pingbacks »
-
[…] This part is based on a script from krazyworks […]
-
[…] is a follow-up to my previous wget notes (1, 2, 3, 4). From time to time I find myself googling wget syntax even though I think I’ve […]
Good script. I modified it a little to only download the first X images for each row. Thank you for posting this! :)
Can you please provide me the modified script for first X images for each row ?
Great script :-)
How would you modify it to only take the first concrete good image?
Thanks
I love your website. It has some very useful and insightful scripts. That’s an understatement. They’re life changing! :) Definitely going to blog your site and reference it in my own forums, if that’s okay? I put my website in the line above http://www.mdlwebsolutions.com/ is a new site that will aim to contain a lot of technical web development content.
sounds very interesting but i cant get it to work, is there any value to change or adjust, exept the keywords.
how should i name the script files, extention ???? and how are you running it ?
i kind of understand the script, but i can’t understand the proper syntax yet.
thanx for help
Thanks man, this works really well. Saved me a lot of time.
Thank you for great job!.
It’s very helpful in my script for anki card’s.
Works great.
Needs to be updated for new Google images url (easy).
Best script to date.
Many thanks. I’ve been looking exactly for script like this for one of my projects. My thanks for you
I hope I’m allowed to be a bit thick and ask – does this already, or can this ever, capture the original largest possible image? I’ve just made the leap to create a little folder from which images are imported into iMovie and I’m trying to create as much script-based automation as possible into getting the required bits of images from the web to be able to immediately acquire keyword-based search results of images and directly import them into iMovie.
This & Quartz Composer & Scripting are going to keep me busy for quite a few days!
This script creats directory and files that are locked. Then it terminates because it can’t overwrite those directory. This happens even when I sudo chmod 777 the directory. And it also happens when I sudo ./file.sh. This is frustrating has anyone experience this same issue?
Hi dears…
I need to download 10000 google images of type JPG for research purpose. What is the best way for doing this?
Please I need a clear and well stated instructions, as I am not familiar with Wget.
can you update this script for use with the new google images url?
I added
&sout=1
to the end of the string and it still doesn’t seem to work for me.
any thoughts?
thanks
I was just wondering how can Google get profits by providing us a free researching system. We pay nothing and the google allows us to search free. How does it work??
Thank you
I have data in a Google spreadsheet that I would like to automatically import into my WordPress website. How do I do it?
Please, I really need help to get back to normal google. I want to get rid of google chrome immediately! I don’t want to just change my homepage back to google but the whole internet as well. Thanks!
I really like using google, but every time I shut down, even though I change the default, and even remove the option of Yahoo!, I still end up getting yahoo search defaulted the next time I go onto google chrome. Please help?
In my old computer, a macbook, whenever i clicked on the shortcut to quit google chrome, it would make me have to hold it, but when i got a new computer, an imac, i installed the newer version of google chrome and whenever i accidentally click on the shortcut, it just quits right away. i don’t like that since i tend to click on the wrong keys most of the time. is there any way i can make it like the older version?