NLP for learners-Scraping the URL of an article from websites in Python

公開日:2020/08/22

最終更新日:2023/01/06

In this series of articles, I’ll introduce you to the process of natural language processing using deep learning. I am also a beginner in Python. This article is aimed at people who have worked with a few other languages, but know little or nothing about Python.

You might think deep learning is difficult, but it’s easy enough to write and run code in it.The goal here is not to understand the intricacies of deep learning, but rather to use it in a practical way to get results.

As an example, we are going to create an AI that generates English sentences. For the training, we will use articles from the American VOA Learning English site.

You need to check the copyright laws of your country before you copy the article to your computer’s hard drive.

Web scraping

Scraping is the process of extracting the necessary data from a website. The scraping process varies from website to website. Here, I’m going to extract the articles from VOA Learning English.

From the top page, go to the “As it is” category page.

We can find several articles. Let’s try to get the URLs of each article from here.

import numpy as np
import requests
from bs4 import BeautifulSoup
import io
import re

There are import of an extension.

numpy is for efficient computation; it can be written as np, which appears in the following lines as np.

requests is used to extract the html of a website. It is used in the following lines where it is shown as requests.

BeutifulSoup is used to parse the html and extract only the required parts.

These three extensions have to be installed beforehand (installation instructions omitted).

io is used for file input and output. re is used for regular expressions.

url = 'https://learningenglish.voanews.com/z/3521'
res = requests.get(url)

requests is used as an extension.

The string url contains the address where the html is to be retrieved and it is retrieved by requests.get(). res is not a string but an object. We can imagine an object as a box containing a variety of information. By writing the above, some information, including the html string, will be thrown into res.

soup = BeautifulSoup(res.text, 'html.parser')

BeautifulSoup extension is used.

res.text points to the label text in res object. We write html on a piece of paper and put a sticky note with text on it. res.text contains the following html string.

<!DOCTYPE html>
<html lang="en" dir="ltr" class="no-js">
<head>
<link href="/Content/responsive/VOA/en-US-LEARN/VOA-en-US-LEARN.css?&amp;av=0.1.0.0&amp;cb=144" rel="stylesheet"/>
<script src="//tags.tiqcdn.com/utag/bbg/voa-pangea/prod/utag.sync.js"></script> <script type='text/javascript' src='https://www.youtube.com/iframe_api'></script>
<script type="text/javascript">

....

We give this to BeutifulSoup and store the information in an object called soup. We then transfer the information to a special box for parsing html.

soup.find_all() extracts the part of the article links from the object soup.

In html, each link to an article is written as <a href='article URL'> </a>.
When you write href=re.compile("/a/"), just <a href='URL'> </a> part is extracted from the html. Furthermore, only URLs containing the character /a/ are extracted and stored in a list elems. This is because the URLs of each article on the VOA Learning English site have the form https://learningenglish.voanews.com/a/.

Since elems contains tags, URLs are further extracted from the list.

links = []
for i in range(len(elems)):
    links.append('https://learningenglish.voanews.com'+elems[i].attrs['href'])

links = [] indicates that links is a list.

for statement repeats the process.

range() specifies a range of iterations, e.g. for i in range(10): where i varies from 0 to 9. Here, len(elems) repeats the number of elements in elems. For example, if there are 5 extracted addresses, elems ranges from elems[0] to elems[4] . In this case, len(elems) = 5.

Next, elems[i].attrs['href'] extracts only the content of href from elems, which is the string containing the tag.

For example,

elems[0] = ‘<a href=”/a/new-us-citizens-look-forward-to-voting/5538093.html”>’

elems[0].attrs[‘href’] = ‘/a/new-us-citizens-look-forward-to-voting/5538093.html’

URLs do not include the domain name, so 'https://learningenglish.voanews.com' should be added.

links.append() adds elements to the list (array) of links. Here, links stores the extracted URLs one by one.

Run the code to check the contents of the links.

https://learningenglish.voanews.com/a/5526946.html
https://learningenglish.voanews.com/a/5526946.html
https://learningenglish.voanews.com/a/new-us-citizens-look-forward-to-voting/5538093.html
https://learningenglish.voanews.com/a/new-us-citizens-look-forward-to-voting/5538093.html
https://learningenglish.voanews.com/a/coronavirus-stops-starts-testing-europeans-patience/5543763.html
https://learningenglish.voanews.com/a/coronavirus-stops-starts-testing-europeans-patience/5543763.html

....

Duplicate URLs seem to have been extracted.

links = np.unique(links)

We’ll use numpy numerical library we mentioned earlier here. When we import it, we declare that numpy is represented by the string np, so we write it as np instead of numpy.

unique() is a feature to remove duplicate data. Thus, you can use numpy to organize your data easily. Here, the duplicated data is deleted and stored in links again.

text='\n'.join(links)

The strings stored in links are joined with .join(). If you write '\n'.join(), you can insert a line feed code between the linked strings. In this way, we get a string with a URL in each line.

with io.open('article-url.txt', 'w', encoding='utf-8') as f:
    f.write(text)

Finally, the program writes the resulting string to the file article-url.txt.

When you open the newly created text file, you will see

https://learningenglish.voanews.com/a/5543923.html
https://learningenglish.voanews.com/a/after-multiple-crises-this-time-lebanese-feel-broken-/5542477.html
https://learningenglish.voanews.com/a/coronavirus-stops-starts-testing-europeans-patience/5543763.html

....

We can confirm that the URLs of the articles have been extracted.

Here is the whole code.

import numpy as np
import requests
from bs4 import BeautifulSoup
import io
import re
url = 'https://learningenglish.voanews.com/z/3521'
res = requests.get(url)
soup = BeautifulSoup(res.text, 'html.parser')
elems = soup.find_all(href=re.compile("/a/"))
links = []
for i in range(len(elems)):
    links.append('https://learningenglish.voanews.com'+elems[i].attrs['href'])
links = np.unique(links)
text='\n'.join(links)
with io.open('article-url.txt', 'w', encoding='utf-8') as f:
    f.write(text)

NLP for learners-Scraping the URL of an article from websites in Python

Web scraping

関連