How to scrape captcha protected websites with Python, BeautifulSoup and MongoDB — Chapter 1

Michael Gerstenberg
10 min readFeb 11, 2021
Photo by Florian Olivo on Unsplash

My wife is learning German. When she stumbles upon one of these complicated german verb conjugations it’s a struggle to find the right translation. She looses plenty of time switching between different translators, copy-pasting terms or she asks me to realise that I have no clue of German grammar. Why not eliminate those unnecessary steps?

So my goal is to create an application where you can input any conjugated German verb no matter what time or modus, and there are plenty in the German language. The output should be the basic form of the verb and a helpful translation. In my case to Thai.

Want to read this story later? Save it in Journal.

Now I will provide you the story of accomplishing this goal. It will be divided into several parts.

  • Chapter 1: Find an appropriate data source
  • Chapter 2: Access the data source
  • Chapter 3: Extract the information
  • Chapter 4: Building the application

This is not a programming course. However I will explain everything I do step by step. That includes a bit of Python, web development, data structures and MongoDB. That means also beginners should be able to follow along.

Find a proper source

The source must have the content we require. That means a relevant coverage of all German verbs and their conjugations. As a plus we would like to have definitions, examples and synonyms.

A google search doesn’t return a comprehensive data set.
My wife already uses a page called verbformen.de. On this site all German verbs including their conjugations can be found. She uses translate.google.com as well for translating the basic verb into Thai. But based on her experience it is often unsatisfying. That derives from the fact that Google translates through English (e.g. google mixes up the word plant what can either be a production plant or vegetable plant). So a further goal is to implement additional translation sources with context. We found a small data set from Longdo which we want to implement as well.

The content needs to be under an appropriate license because we want to use the data without legal problems. Let’s investigate that further.

Copyright / License
It is mandatory that the license of our data source fits our usage. Content under copyright shouldn’t be used at all. Content under license can often be used commercial and/or personal use. If it’s not completely clear we better expect that the data is protected by copyright. In our case we are lucky. The website owner puts all the data under the CC-BY-SA 3.0 license which makes it freely available for personal and even commercial use.

Are bots welcomed?
We can find the answer with just a quick view inside the robots.txt which contains the website regulations. The robots.txt can be accessed by adding it behind any domain. This is an extract of verbformen.de/robots.txt:

User-agent: *
Disallow: *.pdf
Disallow: *.docx
Disallow: *.wav
Disallow: *.mp3
User-agent: msnbot
Crawl-delay: 2
Disallow: *.pdf
Disallow: *.docx
Disallow: *.wav
Disallow: *.mp3
User-agent: SemrushBot
Disallow: /

Now we have to understand what is allowed and was isn’t. In that case crawling of .pdf, .docx, .wav. and .mp3-files is disallowed for every user (*). The msnbot has to wait two seconds (crawl-delay) after every page and SemrushBot is not allowed to crawl at all.

Investigate our data source

In an ideal world we just download a complete data set (e.g. Wikipedia). The second best would be an API like Google offers for translation services. In the case of verformen.de our only option is to get the desired data from the HTML content. There are two approaches to get all relevant pages.

The unstructured approach (like google does it)
In most cases a website doesn’t have an index. In that case we don’t have a comprehensive list of all relevant pages. To tackle that obstacle you could start at some page, extract all links and use them to dig deeper and deeper into the website until no more new pages are found.

The structured approach
In our case we are lucky. Verbformen.de offers an index on a seperate website verblisten.de. This page is basically a complete list of all verbs available on verbformen.de.

A common practise is to save all requested websites locally before scraping. In case something goes wrong there is no need to send requests over and over again. This protects us if the website structure is changed at some point in the future.

Start coding

As most software developers I’m lazy by nature. So before we get started we figure out what’s the most convenient toolset for our use case. And because doing things together is more fun, I picked my coding friend Robin Schnider to help me with our little language project.

Everybody who worked with Python knows that it offers great string and data manipulation out of the box. Additionally there are very helpful libraries which make fast prototyping incredible easy. In my opinion Python is a great fit for scraping data. We will work with my favourite code editor Visual Studio Code and the standard MacOSX zhs-based terminal.

We start with creating a new repository on Github and clone it on our machine. Then we set up a fresh Python 3 environment. So our project is isolated and has it’s own dependencies.

$ git clone http://url/to/our/git
$ cd path/to/our/project
$ python3 -m venv .ENV
$ source .ENV/bin/activate

Scraping the index

Let’s create a new file named 01_get_data_sources_from_verblisten.py. This code will only be used to gather data once. So a couple of methods without classes will do it.

To be able to request a website with Python we need to install the requests library. For pulling data out of the requested HTML we use the BeautifulSoup library. To install both libraries we make use of the Python package manager pip.

(.ENV)$ pip install requests beautifulsoup4

In our Python file we import the just installed libraries, do our first request and create a BeautifulSoup object out of the response content. We access the entire HTML content via the soup object.

import requests
from bs4 import BeautifulSoup
verb_index_url = f'https://www.verblisten.de/listen/verben/anfangsbuchstabe/vollstaendig-1.html?i=^a'
response = requests.get(verb_index_url)
soup = BeautifulSoup(response.content, 'html.parser')

Beautiful Soup basics
For the next steps we need to know some basic functionality of BeautifulSoup. Let’s have a short look on the following code example which will be inside the content variable and some standard use cases.

<html>
<div class="class_1">
<a href="https://url1.com">Link 1</a>
<a href="https://url2.com">Link 2</a>
<a href="https://url3.com">Link 3</a>
<div class="class_2">
<a href="https://url1.com">Link 1</a>
<a href="https://url2.com">Link 2</a>
<a href="https://url3.com">Link 3</a>
</div>
</div>
<section id="section_1">
<a href="https://url1.com">Link 1</a>
<a href="https://url2.com">Link 2</a>
<a href="https://url3.com">Link 3</a>
</section>
</html>

links = content.find_all('a') returns a list with all nine links of the entire document.

divs_with_class_2 = content.find_all('div', class_='class_2') returns a list of all divs with the class class_2. That also works with the attribute id like id='section_1' .

When we call find_all('a') on divs_with_class_2[0] we only receive the three links within the first div which owns the class_2 attribute value:
links_in_div = divs_with_class_2[0].find_all('a').

If we want to receive the content of a specific item i we use links[i].content. To get any available parameter like href we use links[i].get('href').

This is actually all we need. For more details please have a look into the documentation.

Collecting the urls
The goal is to have a complete list data_sources with all verbs and their urls. When we analyse the url from the index pages, we see a counter and a letter. That means we should be able to iterate through all letters. For every single letter we increase the counter until no more content it found.

data_sources = []
base_url = 'https://www.verblisten.de/listen/verben/anfangsbuchstabe/vollstaendig'
verb_index_url = f'{base_url}-{counter}.html?i=^{letter}'

With Python it’s easy to iterate through a string, so we provide a string with all german letters.

alphabet = (string.ascii_lowercase[:26]+'äöü').replace('y','')

At first we implemented an overcomplicated method to iterate through the german alphabet before we realised how simple life can be. Btw no german verbs start with y. To iterate through all index pages we use a for loop iterating through all letters.

alphabet = 'abcdefghijklmnopqrstuvwxzäöü'
for letter in alphabet:
verb_index_url = f'{base_url}-{counter}.html?i=^{letter}'
response = requests.get(verb_index_url)
soup = BeautifulSoup(response.content, 'html.parser')

We increase the counter in the url as long as the div with class listen-spalte contains links. As soon as we don’t receive links anymore we will continue with the next letter. For this iteration we use a while loop.

counter = 0
links_found = True
while links_found:
counter+=1
verb_index_url = f'{base_url}-{counter}.html?i=^{letter}'
response = requests.get(verb_index_url)
soup = BeautifulSoup(response.content, 'html.parser')
for div in soup.find_all('div', class_='listen-spalte'):
links = div.find_all('a')
for a in links:
# creating an object
if len(links) < 1:
links_found = False

We loop through all links creating a dictionary containing relevant information.

data_sources.append({
'word': a.get('title').replace('Konjugation ', ''),
'conjugations': {
'download_status': False,
'url': a.get('href')
},
'scrape_status': False
})

We save the word, the url and the download and scrape status. To document if this page is already downloaded we use the bool download_status, to document if the the word is already scraped we use the bool scrape_status. This enables us to stop and continue these time consuming and error-prone processes of downloading and scraping without loosing progress.

While observing the website structure we realise that we can get definitions and example sentences just be modifying the url a bit. That might be interesting in the future. So we add these additional data sources to our dictionary.

data_sources.append({
'word': a.get('title').replace('Konjugation ', ''),
'conjugations': {
'download_status': False,
'url': a.get('href')
},
'definitions': {
'download_status': False,
'url': a.get('href').replace('verbformen.de/konjugation','woerter.net/verbs')
},
'examples': {
'download_status': False,
'url': a.get('href').replace('.de/konjugation','.de/konjugation/beispiele')
},

'scrape_status': False
})

Connecting to the database
The data_sources list contains all the relevant data. To persist this list without the need of creating a bunch of tables and relations we use a noSQL database. One more benefit for us is that we don’t have to define a schema for now. Because we are working as a team, we both need easy access to the database. So dumping our data into the MongoDB cloud is a good fit. And for our use case it’s free of charge.

We simply register at MongoDB and create a cluster and a database. There are many tutorials how to set it up. So we will not explain that.

From the MongoDB Atlas portal we can receive a connection string including our credentials. To keep the credentials secret we put them into a separate file config.py.

mongo_db_secret = 'mongodb+srv://<user>:<password>@verbcluster.qwvfi.mongodb.net/dict?retryWrites=true&w=majority'

While developing our high class german internet connection had hickups. As usual I restarted the router. What I didn’t have in mind: A new IP was given after restarting. Some time later we continued debugging our script but ran into connection issues related to our mogoDB cluster. When creating a mongo cluster it puts your current IP in the whitelist. So the new IP was not whitelisted. This took some time to figure out… (Germans love their internet … NOT)

Let’s not forget to keep our credentials secret and add config.py to .gitignore before we commit. We also add .DS_Store, .ENV/ and .vscode/ because they only contain temporary data from your machine which is not relevant for our project.

.DS_Store       # temp data from osx
.ENV/ # our python environment
.vscode/ # temp data from Visual Studio
config.py # secrets, passwords, keys nobody should know

To use our MongoDB Atlas cloud database we install the pymongo library.

(.ENV)$ pip install pymongo

We create a new file mongo_db.py for our database connection so we can import that wherever needed. We import the pymongo library and the config.py configuration file.

As we don’t need to use schema validation yet, we let MongoDB create the database automatically. With the following method we connect to MongoDB and return the connection object for a database named dict.

import config
from pymongo import MongoClient
def connect_mongo_db():
client = MongoClient(config.mongo_db_secret)
return client.dict

Let’s extend our file 01_get_data_sources_from_verblisten.py with the mongo_db.py and call the connect method.

from mongo_db import connect_mongo_dbdb = connect_mongo_db()

With just one simple command we can create a collection named data_sources and push our complete data_sources list into it.

db.data_sources.insert_many(data_sources)

Finally we put all our snippets together into one method named get_data_sources() and make our module a main program.

import requests
import string
from bs4 import BeautifulSoup
from mongo_db import connect_mongo_db
db = connect_mongo_db()def get_data_sources():
base_url = 'https://www.verblisten.de/listen/verben/anfangsbuchstabe/vollstaendig'
alphabet = 'abcdefghijklmnopqrstuvwxzäöü'
data_sources = []
for letter in alphabet:
counter = 0
links_found = True
while links_found:
counter+=1
verb_index_url = f'{base_url}-{counter}.html?i=^{letter}'
response = requests.get(verb_index_url)
soup = BeautifulSoup(response.content, 'html.parser')
for div in soup.find_all('div', class_='listen-spalte'):
links = div.find_all('a')
for a in links:
data_sources.append({
'word': a.get('title').replace('Konjugation ', ''),
'conjugations': {
'download_status': False,
'url': a.get('href')
},
'definitions': {
'download_status': False,
'url': a.get('href').replace('verbformen.de/konjugation','woerter.net/verbs')
},
'examples': {
'download_status': False,
'url': a.get('href').replace('.de/konjugation','.de/konjugation/beispiele')
},
'scrape_status': False
})
if len(links) < 1:
links_found = False
db.data_sources.insert_many(data_sources)
if __name__ == "__main__":
get_data_sources()

If you are interested what if __name__ == "__main__": does read that.

To execute our module we use the following command in the terminal. The execution takes a couple of minutes.

(.ENV)$ python 01_get_data_sources_from_verblisten.py

When we look into our MongoDB collection we will see all the verbs including urls after the script finishes.

document inside MongoDB Atlas

We got a collection with all german verbs and links to gather the data we need. Awesome!
In the next part we will download all pages and overcome a captcha obstacle.

You can find all source code here:
https://github.com/michael-gerstenberg/GermanVerbScraper

Hope you enjoyed the first chapter. You can expect the next chapter to be released within one week. Please leave a comment if you have questions or know what we could do/explain better.

Sincerely
Michael & Robin

--

--