| Wget Google image collectorKrazyWorks

Wget Google image collector

Submitted by Igor on February 1, 2006 – 1:07 pm 18 Comments

Google Images is an extremely useful tool for webmasters, designers, editors, and just about anybody else who’s in a hurry to find just the right photo or clipart. However, this Google tool has a couple of annoying issues. First, the collection of images is updated once in a while. Not very often. This means that whenever you get your search results, there will be a bunch of dead links. Second, many Web sites prevent direct linking of images, so you cannot open just the image – you have to open the entire Web page where the image is located.

The idea behind the following BASH shell is to address these two small problems, as well as to save you some time downloading images. The script reads from a list of keywords formatted exactly as you would format your Google search query. For example, to find a bunch of photos of the Red Square in Moscow you would type something like: +”red square” +moscow. This will search for “red square” as a phrase plus the word “moscow”.

The search query is passed to Google as a URL along with the desired image size (icon, small, medium, large, xlarge, xxlarge). If you are looking only for the largest photos, you would specify the image size as “xxlarge”. However, if you are looking for image size from “large” and up, you would use the “pipe” character in place of a logical “OR”: large|xlarge|xxlarge. See the script below for an example.

The following step for the script is to count the number of search hits and download all Google pages listing the search results. A search result may say something like: “found 250 matches, displaying images 1-20”. Once again, the script formats an appropriate URL and sends it to Google to get search results 21-40, 41-60, etc. until it reaches 250.

All the search results are parsed to extract only the image URLs linking to the actual photos on various sites. These URLs are dumped into a file and fed to Wget. The direct linking issue is no longer a concern, because you open each image URL directly and your “HTTP Referrer” is blank. The downloaded images are written to a separate directory for each search query. A minimum file size filter is applied to the downloaded photos: all images smaller than, say, 30Kb are deleted. ImageMagick is used to enlarge images if they are smaller than a certain size.

You will need to create a file called “keywords.txt” in the same directory as the script. The file should contain one search query per line. Here’s an example:

+"red square" +moscow
+"times square" +"new york"
+bush +iraq +wmd +"big fat liar"

And here’s the script:

#!/bin/bash

# Collect and resize images from Google Images based on a keyword.

#------------------------------
# VARIABLES
#------------------------------

DATETIME=$(date +'%Y-%m-%d')
HOMEDIR="/WD120GB_01/misc/google"
SIZE="large|xlarge|xxlarge"
#SIZE="icon|small|medium|large|xlarge|xxlarge"

#------------------------------
# CONFIGURATION
#------------------------------

if [ ! -d "$HOMEDIR" ]
then
        echo "Home directory $HOMEDIR not found. Exiting..."
        exit 1
fi

if [ ! -f "${HOMEDIR}/keywords.txt" ]
then
        echo "Keyword list not found. Exiting..."
        exit 1
fi

#------------------------------
# IMAGE SEARCH & DOWNLOAD
#------------------------------

STATUS=1

i=0
cat "${HOMEDIR}/keywords.txt" | while read KEYWORD
do
        OUTDIR=$(echo "$KEYWORD" | sed 's/ /_/g' | sed 's/+/_/g' | sed "s/'//g" | sed 's/"//g' | sed 's/*//g' | sed 's/^_//g' | sed 's/__/_/g')
        OUTDIR=$(echo "${HOMEDIR}/${OUTDIR}_${DATETIME}")

        if [ ! -d "${OUTDIR}" ]
        then
                mkdir "${OUTDIR}"
        else
                echo "Directory ${OUTDIR} already exists. Skipping..."
                STATUS=0
        fi

        if [ $STATUS -eq 1 ]
        then
                URL="http://images.google.com/images?q=${KEYWORD}&svnum=100&hl=en&lr=&safe=off&sa=G&imgsz=${SIZE}"
                wget -U Mozilla -O "${OUTDIR}/results_${i}.txt" "$URL" -e robots=off

                cat "${OUTDIR}/results_${i}.txt" | sed 's/href/n/g' | grep imgurl | grep imgrefurl | sed 's/imgurl=/@/g' | sed 's/&imgrefurl/@/g' | awk -F'@' '{print $2}' > "${OUTDIR}/image_urls.txt"
                results=$(cat "${OUTDIR}/results_${i}.txt" | sed 's/border/n/g' | fgrep '&start=' | fgrep -v '&start=0' | sort | uniq | fgrep '
' | wc -l | awk '{print $1}')

                i=1
                while [ $i -lt $results ]
                do
                        (( START = i * 20 ))
                        URL="http://images.google.com/images?q=${KEYWORD}&svnum=100&hl=en&lr=&safe=off&sa=G&imgsz=${SIZE}&start=${START}"
                        wget -U Mozilla -O "${OUTDIR}/results_${i}.txt" "$URL" -e robots=off
                        cat "${OUTDIR}/results_${i}.txt" | sed 's/href/n/g' | grep imgurl | grep imgrefurl | sed 's/imgurl=/@/g' | sed 's/&imgrefurl/@/g' | awk -F'@' '{print $2}' >> "${OUTDIR}/image_urls.txt"
                        (( i = i + 1 ))
                done

                find "$OUTDIR" -type f -name "results_*.txt" -exec rm {} ;
                cat "${OUTDIR}/image_urls.txt" | fgrep '.jpg' | sort | uniq > /tmp/google_image_collector.tmp
                mv /tmp/google_image_collector.tmp "${OUTDIR}/image_urls.txt"

                if [ -f "${OUTDIR}/image_urls.txt" ]
                then
                        clear
                        COUNT=$(cat "${OUTDIR}/image_urls.txt" | wc -l | awk '{print $1}')
                        echo "Found $COUNT images matching $KEYWORD"

                        j=1
                        cat "${OUTDIR}/image_urls.txt" | while read LINE
                        do
                                wget -U Mozilla -nd -t 1 -T 5 -O "${OUTDIR}/photo_${j}.jpg" "$LINE" -e robots=off

                                if [ `ls -als "${OUTDIR}/photo_${j}.jpg" | awk '{print $6}'` -lt 10000 ]
                                then
                                        rm "${OUTDIR}/photo_${j}.jpg"
                                else
                                        convert -filter Cubic -resize '640x640< ' "${OUTDIR}/photo_${j}.jpg" "${OUTDIR}/photo_${j}.jpg"
                                        (( j = j + 1 ))
                                fi
                        done
                fi
        fi
done

18 Comments »

metalelf0 says:
August 13, 2009 at 5:02 am
Good script. I modified it a little to only download the first X images for each row. Thank you for posting this! :)

Loading...
Reply to this comment »
- Lalit Jain says:
  December 8, 2013 at 5:12 pm
  Can you please provide me the modified script for first X images for each row ?
  
  Loading...
  Reply to this comment »
ramadan says:
November 12, 2009 at 3:03 am
Great script :-)

How would you modify it to only take the first concrete good image?

Thanks

Loading...
Reply to this comment »
Milo Lamar says:
May 31, 2010 at 6:56 am
I love your website. It has some very useful and insightful scripts. That’s an understatement. They’re life changing! :) Definitely going to blog your site and reference it in my own forums, if that’s okay? I put my website in the line above http://www.mdlwebsolutions.com/ is a new site that will aim to contain a lot of technical web development content.

Loading...
Reply to this comment »
jack` says:
June 3, 2010 at 10:09 pm
sounds very interesting but i cant get it to work, is there any value to change or adjust, exept the keywords.
how should i name the script files, extention ???? and how are you running it ?

i kind of understand the script, but i can’t understand the proper syntax yet.

thanx for help

Loading...
Reply to this comment »
misha says:
August 15, 2010 at 5:54 am
Thanks man, this works really well. Saved me a lot of time.

Loading...
Reply to this comment »
NiCloAy says:
September 16, 2010 at 12:13 pm
Thank you for great job!.
It’s very helpful in my script for anki card’s.

Loading...
Reply to this comment »
doc jim says:
November 14, 2010 at 9:19 am
Works great.
Needs to be updated for new Google images url (easy).
Best script to date.

Loading...
Reply to this comment »
filip says:
January 4, 2011 at 7:48 am
Many thanks. I’ve been looking exactly for script like this for one of my projects. My thanks for you

Loading...
Reply to this comment »
esaruoho says:
January 13, 2012 at 9:23 pm
I hope I’m allowed to be a bit thick and ask – does this already, or can this ever, capture the original largest possible image? I’ve just made the leap to create a little folder from which images are imported into iMovie and I’m trying to create as much script-based automation as possible into getting the required bits of images from the web to be able to immediately acquire keyword-based search results of images and directly import them into iMovie.

This & Quartz Composer & Scripting are going to keep me busy for quite a few days!

Loading...
Reply to this comment »
Maxmillie says:
January 26, 2012 at 12:20 am
This script creats directory and files that are locked. Then it terminates because it can’t overwrite those directory. This happens even when I sudo chmod 777 the directory. And it also happens when I sudo ./file.sh. This is frustrating has anyone experience this same issue?

Loading...
Reply to this comment »
omed says:
February 2, 2012 at 6:54 pm
Hi dears…
I need to download 10000 google images of type JPG for research purpose. What is the best way for doing this?
Please I need a clear and well stated instructions, as I am not familiar with Wget.

Loading...
Reply to this comment »
ryan says:
April 24, 2012 at 7:31 pm
can you update this script for use with the new google images url?

I added
&sout=1

to the end of the string and it still doesn’t seem to work for me.

any thoughts?

thanks

Loading...
Reply to this comment »
Coffee t says:
March 21, 2013 at 10:14 pm
I was just wondering how can Google get profits by providing us a free researching system. We pay nothing and the google allows us to search free. How does it work??
Thank you

Loading...
Reply to this comment »
nick s says:
March 24, 2013 at 1:22 pm
I have data in a Google spreadsheet that I would like to automatically import into my WordPress website. How do I do it?

Loading...
Reply to this comment »
SteveO says:
March 25, 2013 at 5:52 am
Please, I really need help to get back to normal google. I want to get rid of google chrome immediately! I don’t want to just change my homepage back to google but the whole internet as well. Thanks!

Loading...
Reply to this comment »
Tyler H says:
March 26, 2013 at 12:55 pm
I really like using google, but every time I shut down, even though I change the default, and even remove the option of Yahoo!, I still end up getting yahoo search defaulted the next time I go onto google chrome. Please help?

Loading...
Reply to this comment »
nasty1 says:
March 29, 2013 at 4:57 pm
In my old computer, a macbook, whenever i clicked on the shortcut to quit google chrome, it would make me have to hold it, but when i got a new computer, an imac, i installed the newer version of google chrome and whenever i accidentally click on the shortcut, it just quits right away. i don’t like that since i tend to click on the wrong keys most of the time. is there any way i can make it like the older version?

Loading...
Reply to this comment »

2 Pingbacks »

chmod+x loves you script : fork/memo says:
July 7, 2010 at 8:59 am
[…] This part is based on a script from krazyworks […]

Loading...
Reply
Wget Examples | KrazyWorks says:
March 1, 2017 at 9:25 pm
[…] is a follow-up to my previous wget notes (1, 2, 3, 4). From time to time I find myself googling wget syntax even though I think I’ve […]

Loading...
Reply