Let’s scrape the web (with Selenium)— Part 2

In my previous article, I explained what is a scraper and built a really basic scraper (actually not a scraper). In this article I am going to share little tips and tricks for scraping and we will continue on our “funny cat memes” scraper.

Are you a human?

reCAPTCHA v3, Image by fossbytes.com

When scraping a website, you should be careful. Because website can detect if you are a real human browsing or built a scraper to browse.

There are some tricks so look like a real human being browsing. For example in previous article, when we use send_keys , Selenium instantly typed the text we want. Real user won’t do such thing. Real users wait between key presses (about 150–180 milliseconds).

Also the delay between each character shouldn’t be same. Think about it. When you want to type something, the distance between each character is different, so your delay between characters is also different (at least different for me). So we also want to select random number between minimum delay and maximum delay.

For doing this, we need another library called, you guessed it, random . Inside random there is a function called randrange() , it will take two arguments, minimum and maximum, and return a value between these two values.

You should think about this when building a scraper. Otherwise website could ban your IP and won’t send you result you expect.

Another important thing is, you should wait between requests sent to the server. This is not the case right now, because we are using Selenium and with Selenium we are not sending 10 or more requests in one second.

Anyways, let’s add some delay between key strokes. It’s actually really simple with Python, because it have time library and inside it there is a method called sleep .

Let’s import these two libraries.

from time import sleep
from random import randrange

Now let’s create a function that takes some arguments and types the keys with delays. I named this function send_keys_delayed .

def send_keys_delayed(element, text):
for c in text:
delay = randrange(150, 180) / 100
element.send_keys(c)
sleep(delay)

This function will type the text into element, but with a delay between 150 and 180 milliseconds. You may ask, “Why we divide our delay to 100?”. It’s because sleep function works with seconds.

Now let’s use our new function. After finding search_box use our new function to send keys.

send_keys_delayed(search_box, "funny cat memes")

If you run this script again, you will see that there is a delay between each key stroke.

Downloading images

Our goal is, saving funny cat memes because we are cat lovers, right?

First thing we need to do is, change our URL to www.google.com/imghp , because we want to scrape images.

driver.get("https://www.google.com/imghp")

When we run this program again, you will see funny cat memes in Google Images.

Now let’s find sources of these images. If you open Developer Tools again and inspect a image, you will see two things that is same for every image. First thing is src attribute of image is base64 encoded. Other common thing is all images has a class called rg_i . So our goal is, getting the image and base64 string and decoding that string.

Selenium has so many great methods for finding the element or elements you want. In this case we want multiple elements by class name. This function is called find_elements_by_class_name() .

images = driver.find_elements_by_class_name("rg_i")

This line will return a array with elements that have rg_i class. This is exactly what we want. After that we also need to get src attribute of these images. Also, in Selenium, all elements has a function called get_attribute . This function returns the value of attribute you want.

In our case, we want to get `src` attribute.

sources = [image.get_attribute("src") for image in images]

If you don’t know this syntax, it’s called List Comprehension. It provides much cleaner syntax.

After writing this line, we will have all the Base64 encoded strings inside sources array.

Disclaimer: Please don’t put sources array into a print statement.

Now we need only one job. Decode these Base64 strings to image and saving them. For this purpose, I am creating another function called decodeB64StringToImage(data, filename) . Let’s import base64 module and define the function.

from base64 import b64decodedef decodeB64StringToImage(data, filename):
with open(filename, "wb") as fh:
fh.write(b64decode(data))

This is the function that converts Base64 string to Image. Now let’s put our sources array into this function one by one.

for idx, src in enumerate(sources):
if(src == None):
continue
parsedSrc = src[src.find(",")+1:]
decodeB64StringToImage(parsedSrc, "./cats/{0}.jpg".format(idx))

You may ask, “why we are doing src[src.find(",")+1:] ?”. Because when we get our src attribute from Google, it comes with some headers.

...

In the code block above, you can see that data:image/jpeg;base64, is just tells us this is encoded with base64 and original file type is jpeg. To convert this string to image, we have to get rid of this header.

If we run our script now, you will see a folder and inside that folder, you will see bunch of cat memes.

We’ve done it. We built our first Web Scraper using Selenium and Python. It searches for what we want and downloads photos. If you don’t like cats ( :( ), you can search for funny dog memes or even funny penguins.

Recap

Selenium automates web browser and does what we want. We can search elements by with it’s class name, type or with it’s XPath.

We’ve made a function that send key strokes with a random delay. And finally, we decoded Base64 string into Jpeg file and saved it.

There is so many things you can do with Selenium. If you want to check it’s documentation, you can click the link below.

Also source code of previous and current article is here:

Source code

Thanks for reading this far.

Studying Physics Engineering in Hacettepe University, Turkey. Programming since 7th grade.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store