In this example I followed a two part process to get the lyrics of the most popular songs from the top 10 artists:

I used rvest to extract the Top 10 Pop Artists of All Time from billboard.com . Then I used these artists to extract their popular songs and lyrics from genius.com .

Follow me on a step by step walk-through

First of course you will need to install and load the following packages.

library(tidyverse)

library(rvest)

PART 1 : Extracting the Top 10 Pop Artists of All Time — Source: www.billboard.com

Identify the url from which you want to extract data. Then use the read_html() function to create an html document from the url.

function to create an html document from the url. Next you need to identify the CSS selector which points to the data you want to extract. It’s helpful if to have a little knowledge on HTML and CSS. Else you can use the handy Chrome extension SelectorGadget to find the CSS selector. The easiest way I found is to right-click on any page element in Chrome and select Inspect Element .

to find the CSS selector. The easiest way I found is to right-click on any page element in Chrome and select . You can then use the html_nodes() function with the CSS selector to extract the data you want.

function with the CSS selector to extract the data you want. Then all you need is to save your results into a data frame. I’ve used tibbles here to store the data as it is a little easier to work with them than data frames. You can read more about the difference between data frames and tibbles here.

PART 2 : Extracting Popular Songs and Lyrics of the top 10 Artists — Source: www.genius.com

Now that you have the Top 10 Pop Artists, you can use the genius.com website to identify the most popular songs and extract their lyrics.

First identify the url to the artist’s webpage. A little bit of research on the website showed that all the webpages followed the below format. https://genius.com/artists/<artistname>

Then use a nested for loop to extract the songs and their lyrics. Here again, I used SelectorGadget to identify the right CSS selector for the job.

Next store the results into a tibble.

And finally, don’t forget to include a random sleep interval between each loop to prevent you from getting booted from the website.

And here’s a snapshot of what your dataset will look like.

artist_lyrics tibble

That’s all you need to know to create your own dataset. Thus giving you endless possibilities to experiment with data you want.

Hope you enjoyed this tutorial and are now inspired to create your very own dataset.

Thanks for reading!

Follow me on instagram at for my weekly learning progress and study resources I use.