What is Web Scraping and how it works? — Part 1

In this series I am going to talk about my experiences in Web Scraping and I am going to try my best to explain what Web Scraping is and how to make a web scraper.

What is Web Scraping

Think about yourself. You are browsing the internet for hours (maybe not), maybe you save great inspiration pictures on Pinterest, or you search a question in StackOverflow and copy that piece of code to your code. Possibilities are endless but the key thing is, you are surfing the web and save useful information to somewhere.

If you are doing this, that means you are a scraper. Don’t worry, it’s not a bad thing to be (actually it is bad in some ways).

So, we have a new keyword, scraper. We also know what scraper does. A scraper goes to a website and copies useful information and saves to a folder.

Web scraping is all about scrapers. For example, you want to copy all funny cat memes from Pinterest to your folder. You have two options:

First option is, go to Pinterest in your web browser, search “funny cat memes”, and click every single image in the view and copy that to your computer, scroll down, and do the last 2 step probably 1000–2000 times.

It is really time consuming and boring. But think about a computer. Computers does not have time and computers does not get bored. You can tell your computer to “go and find me funny cat memes on Pinterest”.

But how do we tell that?

How to Scrape the Web?

Don’t worry, it is really easy to scrape the web. Because so many people works together and builds some great open source tools.

In this series, we are going to use Python because it’s simple and easy to write. But for tool, we have two options.

We can automate our web browser or we can fool the website that we are using a web browser. I will cover both options in-depth in another article.

Let’s build our first scraper

I’m guessing you have Python installed, if you don’t have a Python installation you can easily install Python in Official Python Website.

Website of Python programming language.
Website of Python programming language.
Python’s official website

For this little project, we are going to use Selenium binding for Python.

Selenium automates a web browser. For example if you tell it, go to “google.com”, it will type google.com in address bar and goes to the website you said. But as you can see, we need to install a web browser for this. You may ask “I already have a web browser, why do i need to install another one?”.

You are correct but think about it. You’ve build a scraper using Chrome, and wanted to sell it to someone who doesn’t have Chrome. Asking them to install Chrome is not we want. So we need to install it somewhere in our code folder and ship our Scraper with that folder. In that way, the customer doesn’t need to install additional softwares.

We have two choices for browser. Chrome or Firefox. In this series, I am going to install Chrome browser. Now let’s go ChromeDriver’s website and install it.

Once you open the website, you have two options. Beta release or Stable release. I am going to install Stable release. So, I’ve clicked Latest Stable Release.

Website of Chrome driver
Website of Chrome driver
Chrome Driver’s website

Once you click your release version, another page opens and from there, you need to download according to your operating system.

Operating system selection for Chrome driver.
Operating system selection for Chrome driver.
Chrome Driver versions for different operating systems.

After you unzip your file, you should see a executable file. This file needs to go inside your code folder.

Installing Selenium is really easy, just open your command prompt (or terminal) and type;

pip install selenium

This will install Selenium in your computer. But it’s not done yet.

Now open your favorite text editor (mine is Visual Studio Code) and create a new file called main.py in same directory with your driver.

At this point, your folder should look like this.

Chrome driver and main Python file placed in same folder.
Chrome driver and main Python file placed in same folder.
Folder structure required for this project.

I am using Visual Studio Code so my full editor is looks like this right now.

Visual studio code editor.
Visual studio code editor.
Visual Studio Code editor screen (zoomed in)

Now, let’s write some code.

First thing we need to do is, initialize Selenium.

from selenium import webdriver
from selenium.webdriver.common.keys import Keys
driver = webdriver.Chrome("./chromedriver.exe")

This code block imports webdriver from selenium library and creates a new Chrome driver with ./chromedriver.exe . Second line of code is imports the Keys, you will see why we need this.

If you run this Python script, it will create a empty Chrome window and exits the program.

Our goal is to automate this empty Chrome window to do stuff for us.

For simplicity, we will go to Google.com and search for “funny cat memes”. For doing this, we need to tell our driver to go get the Google homepage, find the search box element,send keys to the search box and press enter while focus is in the search box.

It looks like much but it’s not. The code I wrote is this.

driver.get("https://www.google.com")search_box = driver.find_element_by_name("q")search_box.send_keys("funny cat memes")search_box.send_keys(Keys.RETURN)

The first line in this code block tells the browser to go to that website.

search_box = driver.find_element_by_name("q") is finds the element with name attribute is “q”. How do we know that the search box’s name attribute is “q”? If you open Google on your web browser and press F12 or right click to search box and click Inspect Element, developer tools will show you the HTML of the website.

Chrome DevTools showing HTML of search box.
Chrome DevTools showing HTML of search box.
Chrome DevTools highlights search box’s HTML

If you look closely <input> element, you will see name="q" attribute. This is what we’ve trying to get in the code.

search_box.send_keys("funny cat memes") is types “funny cat memes” to the search box.

search_box.send_keys(Keys.RETURN) is simulates that pressing Enter key in our keyboard.

Now when we run this code, a Chrome window will open will do our instructions.

We’ve done it. But where is the cat memes? We didn’t save cat memes yet. For sake of simplicity, let’s take a screenshot of what we see.

driver.get_screenshot_as_file("screenshot.png")

When we type this command at the end of our main.py script. It will take screenshot of the webpage and save it as screenshot.png .

Screenshot of google search result
Screenshot of google search result
Screenshot taken by Selenium

This screenshot is what our scraper saves in folder.

As you can see we’ve build a really basic scraper. Actually this is not very useful because we’re just capturing the screenshot of search result.

All the code I’ve write in this article is in here.

If you’ve read this far, thank you so much.

Part 2 of this series is here:

See you on next articles.

Until then, cheers.

Studying Physics Engineering in Hacettepe University, Turkey. Programming since 7th grade.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store