What is Web Scraping and how it works? — Part 1
In this series I am going to talk about my experiences in Web Scraping and I am going to try my best to explain what Web Scraping is and how to make a web scraper.
What is Web Scraping
Think about yourself. You are browsing the internet for hours (maybe not), maybe you save great inspiration pictures on Pinterest, or you search a question in StackOverflow and copy that piece of code to your code. Possibilities are endless but the key thing is, you are surfing the web and save useful information to somewhere.
If you are doing this, that means you are a scraper. Don’t worry, it’s not a bad thing to be (actually it is bad in some ways).
So, we have a new keyword, scraper. We also know what scraper does. A scraper goes to a website and copies useful information and saves to a folder.
Web scraping is all about scrapers. For example, you want to copy all funny cat memes from Pinterest to your folder. You have two options:
First option is, go to Pinterest in your web browser, search “funny cat memes”, and click every single image in the view and copy that to your computer, scroll down, and do the last 2 step probably 1000–2000 times.
It is really time consuming and boring. But think about a computer. Computers does not have time and computers does not get bored. You can tell your computer to “go and find me funny cat memes on Pinterest”.
But how do we tell that?
How to Scrape the Web?
Don’t worry, it is really easy to scrape the web. Because so many people works together and builds some great open source tools.
In this series, we are going to use Python because it’s simple and easy to write. But for tool, we have two options.
We can automate our web browser or we can fool the website that we are using a web browser. I will cover both options in-depth in another article.
Let’s build our first scraper
I’m guessing you have Python installed, if you don’t have a Python installation you can easily install Python in Official Python Website.
For this little project, we are going to use Selenium binding for Python.
Selenium automates a web browser. For example if you tell it, go to “google.com”, it will type google.com in address bar and goes to the website you said. But as you can see, we need to install a web browser for this. You may ask “I already have a web browser, why do i need to install another one?”.
You are correct but think about it. You’ve build a scraper using Chrome, and wanted to sell it to someone who doesn’t have Chrome. Asking them to install Chrome is not we want. So we need to install it somewhere in our code folder and ship our Scraper with that folder. In that way, the customer doesn’t need to install additional softwares.
We have two choices for browser. Chrome or Firefox. In this series, I am going to install Chrome browser. Now let’s go ChromeDriver’s website and install it.
Once you open the website, you have two options. Beta release or Stable release. I am going to install Stable release. So, I’ve clicked Latest Stable Release.
Once you click your release version, another page opens and from there, you need to download according to your operating system.
After you unzip your file, you should see a executable file. This file needs to go inside your code folder.
Installing Selenium is really easy, just open your command prompt (or terminal) and type;
pip install selenium
This will install Selenium in your computer. But it’s not done yet.
Now open your favorite text editor (mine is Visual Studio Code) and create a new file called main.py
in same directory with your driver.
At this point, your folder should look like this.
I am using Visual Studio Code so my full editor is looks like this right now.
Now, let’s write some code.
First thing we need to do is, initialize Selenium.
from selenium import webdriver
from selenium.webdriver.common.keys import Keysdriver = webdriver.Chrome("./chromedriver.exe")
This code block imports webdriver
from selenium
library and creates a new Chrome driver with ./chromedriver.exe
. Second line of code is imports the Keys, you will see why we need this.
If you run this Python script, it will create a empty Chrome window and exits the program.
Our goal is to automate this empty Chrome window to do stuff for us.
For simplicity, we will go to Google.com and search for “funny cat memes”. For doing this, we need to tell our driver
to go get
the Google homepage, find
the search box element,send keys
to the search box and press enter while focus is in the search box.
It looks like much but it’s not. The code I wrote is this.
driver.get("https://www.google.com")search_box = driver.find_element_by_name("q")search_box.send_keys("funny cat memes")search_box.send_keys(Keys.RETURN)
The first line in this code block tells the browser to go to that website.
search_box = driver.find_element_by_name("q")
is finds the element with name attribute is “q”. How do we know that the search box’s name attribute is “q”? If you open Google on your web browser and press F12 or right click to search box and click Inspect Element, developer tools will show you the HTML of the website.
If you look closely <input>
element, you will see name="q"
attribute. This is what we’ve trying to get in the code.
search_box.send_keys("funny cat memes")
is types “funny cat memes” to the search box.
search_box.send_keys(Keys.RETURN)
is simulates that pressing Enter key in our keyboard.
Now when we run this code, a Chrome window will open will do our instructions.
We’ve done it. But where is the cat memes? We didn’t save cat memes yet. For sake of simplicity, let’s take a screenshot of what we see.
driver.get_screenshot_as_file("screenshot.png")
When we type this command at the end of our main.py
script. It will take screenshot of the webpage and save it as screenshot.png
.
This screenshot is what our scraper saves in folder.
As you can see we’ve build a really basic scraper. Actually this is not very useful because we’re just capturing the screenshot of search result.
All the code I’ve write in this article is in here.
If you’ve read this far, thank you so much.
Part 2 of this series is here:
See you on next articles.
Until then, cheers.