In this article, I would like to use Python to scrape and summarise the story of a news article link from a news website and to extract the keywords about that particular article.

The website used in this example: https://www.channelnewsasia.com/

Link used in this example: https://www.channelnewsasia.com/news/singapore/mrt-north-south-line-train-delay-smrt-signalling-fault-12186480

I would be using the Jupyter notebook located on the Anaconda Navigator.

Begin by importing the following packages into the notebook. In case you do not have a specific package, do a pip install using the Anaconda Prompt. Once the packages are loaded, we can begin our text extraction and summarization.

We would be using BeautifulSoup to extract the webpage HTML tags. To understand which tag to use for headlines or body we can view the page source on chrome by right-clicking and selecting View page source or Ctrl + U on windows.

In this example, suppose I want to extract the headline of the article, I would Ctrl + F and search specific keywords of the material to view its source.

We can see that the Article’s title or headline has an h1 class. Once gathered, we can begin developing our code with the following commands.

Counter checking that our printed headline corresponds to the one displayed on the website.

To get the tags of the main content of the article, use the similar steps as above and search the article context in the page source.

From the source code, we can tell that the content of the article belongs to the paragraph or tag. To extract the entire tag contents of the article, we can use the following code. To remove additional white spaces around the p-tags, use the strip function.

We would then be able to obtain a text article, but it still contains newline characters represented as

(or line breaks) and sentences that might not be important that do not end with periods. Or it might include unnecessary information such as the author name or article written date which we aren’t that interested in.

To solve this, we would filter out the sentences that contain newline characters ‘

’ or don’t contain periods ‘ . ‘

We would then combine the items into a string with the join command.

Great! Our article looks great without additional line breaks and unnecessary sentences, not ending with periods.

To summarize the article, we will use the summarize function from the gensim package we imported earlier. Using the following code, and the ratio represents how much text the summarizer outputs.

Where the Ratio represents the fraction of sentences in the original text should be returned as an output. For example, a ratio of 0.5 represents that we want to retain 50% of the original version. The default ratio is 20% or 0.2.

Thanks to the gensim package, we were able to trim down on the number of words of the article and only focus on the essential texts. To add a little more information to our article summary, we can print out titles such as:

The length of the original title, the length of the summarised article so we can compare the difference, the article headline, etc..

Additionally, using the keywords function we imported from the gensim package in the beginning, we can obtain the keywords of the article. Only in this case, the keywords obtained does not provide any additional or useful context. This might prove more effective when analysing large text documents or books!

There you have it! A short guide showcasing how you can web scrape news articles from websites and summarising them into a more concise version without compromising on the content of the article. And the generation of keywords related to the material or body of the text.