In my previous article, I explained what is a scraper and built a really basic scraper (actually not a scraper). In this article I am going to share little tips and tricks for scraping and we will continue on our “funny cat memes” scraper.
What is Web Scraping and how it works? — Part 1
General information about Web Scraping and Scrapers and really basic scraper made by Python.
Are you a human?
When scraping a website, you should be careful. Because website can detect if you are a real human browsing or built a scraper to browse.
There are some tricks so look like a real human being browsing. For example in previous article, when we use
send_keys , Selenium instantly typed the text we want. Real user won’t do such thing. Real users wait between key presses (about 150–180 milliseconds).
Also the delay between each character shouldn’t be same. Think about it. When you want to type something, the distance between each character is different, so your delay between characters is also different (at least different for me). So we also want to select random number between minimum delay and maximum delay.
For doing this, we need another library called, you guessed it,
random . Inside
random there is a function called
randrange() , it will take two arguments, minimum and maximum, and return a value between these two values.
You should think about this when building a scraper. Otherwise website could ban your IP and won’t send you result you expect.
Another important thing is, you should wait between requests sent to the server. This is not the case right now, because we are using Selenium and with Selenium we are not sending 10 or more requests in one second.
Anyways, let’s add some delay between key strokes. It’s actually really simple with Python, because it have
time library and inside it there is a method called
Let’s import these two libraries.
from time import sleep
from random import randrange
Now let’s create a function that takes some arguments and types the keys with delays. I named this function
def send_keys_delayed(element, text):
for c in text:
delay = randrange(150, 180) / 100
This function will type the text into element, but with a delay between 150 and 180 milliseconds. You may ask, “Why we divide our delay to 100?”. It’s because
sleep function works with seconds.
Now let’s use our new function. After finding
search_box use our new function to send keys.
send_keys_delayed(search_box, "funny cat memes")
If you run this script again, you will see that there is a delay between each key stroke.
Our goal is, saving funny cat memes because we are cat lovers, right?
First thing we need to do is, change our URL to
www.google.com/imghp , because we want to scrape images.
When we run this program again, you will see funny cat memes in Google Images.
Now let’s find sources of these images. If you open Developer Tools again and inspect a image, you will see two things that is same for every image. First thing is
src attribute of image is base64 encoded. Other common thing is all images has a class called
rg_i . So our goal is, getting the image and base64 string and decoding that string.
Selenium has so many great methods for finding the element or elements you want. In this case we want multiple elements by class name. This function is called
images = driver.find_elements_by_class_name("rg_i")
This line will return a array with elements that have
rg_i class. This is exactly what we want. After that we also need to get
src attribute of these images. Also, in Selenium, all elements has a function called
get_attribute . This function returns the value of attribute you want.
In our case, we want to get `src` attribute.
sources = [image.get_attribute("src") for image in images]
If you don’t know this syntax, it’s called List Comprehension. It provides much cleaner syntax.
After writing this line, we will have all the Base64 encoded strings inside
Disclaimer: Please don’t put sources array into a print statement.
Now we need only one job. Decode these Base64 strings to image and saving them. For this purpose, I am creating another function called
decodeB64StringToImage(data, filename) . Let’s import
base64 module and define the function.
from base64 import b64decodedef decodeB64StringToImage(data, filename):
with open(filename, "wb") as fh:
This is the function that converts Base64 string to Image. Now let’s put our
sources array into this function one by one.
for idx, src in enumerate(sources):
if(src == None):
continue parsedSrc = src[src.find(",")+1:]
You may ask, “why we are doing
src[src.find(",")+1:] ?”. Because when we get our
src attribute from Google, it comes with some headers.
In the code block above, you can see that
data:image/jpeg;base64, is just tells us this is encoded with base64 and original file type is jpeg. To convert this string to image, we have to get rid of this header.
If we run our script now, you will see a folder and inside that folder, you will see bunch of cat memes.
We’ve done it. We built our first Web Scraper using Selenium and Python. It searches for what we want and downloads photos. If you don’t like cats ( :( ), you can search for funny dog memes or even funny penguins.
Selenium automates web browser and does what we want. We can search elements by with it’s class name, type or with it’s XPath.
We’ve made a function that send key strokes with a random delay. And finally, we decoded Base64 string into Jpeg file and saved it.
There is so many things you can do with Selenium. If you want to check it’s documentation, you can click the link below.
Selenium with Python - Selenium Python Bindings 2 documentation
Note This is not an official documentation. If you would like to contribute to this documentation, you can fork this…
Also source code of previous and current article is here:
Thanks for reading this far.