Hello fellow R enthusiasts! As I’m sure you’ve heard, cryptocurrencies have been all the rage lately. Like all other asset classes, any crypto investment decision you make should include at least some degree of quantitative analysis. There are plenty of solid tools with which to do so. Live Coin Watch lets you set all kinds of useful filters and custom layouts. CoinTracker allows you to track portfolios, and helps you calculate your cost basis (no easy feat 😬). The "View All" tab of coinmarketcap.com lists out price information for almost every one of the 1500 various crypto coins, tokens, and assets* in existence (*don’t worry if you’re confused by the terminology, it’s still very much in flux). Great stuff all around.

Regardless of how good the UIs are on such sites though, sometimes it's just better to copy/paste into a spreadsheet and grind out whatever analysis you choose to pursue. And sure, we can always manually click-drag-copy-paste-edit when we want such information, but why do things manually when you can automate?

Enter our beloved R: with a few packages and a couple lines of code, we can rapidly tee up cryptocurrency price data in a clean, simple csv file. As discussed below, we'll use the “rvest” package to web scrape price data from coinmarketcap, and then use text matching/replacement and other base R functions to calculate and build a series of custom columns used for additional analysis.

Before we do all that though, a few quick notes:

For the code novices or beginners amongst us, don’t worry, this article assumes *minimal* prior knowledge or experience with R, or code in general. If you’ve never written a line of code in your life, I’d in fact strongly encourage you to keep reading! Beyond an easy setup, all you’ll have to do is copy, paste, and execute the code we discuss (4 keystrokes MAX). Simple stuff!!

For the more advanced folks reading this, yes, yes, I know what you’re thinking: “why not just use the API??” Sure, that's the way to go if you have the necessary skillsets in place, and for those that do, perhaps I'll eventually write a followup to this using jsonlite to tap into coinmarketcap directly. Short of that, however, I think web scraping is both easier to explain and also more accessible for the kind of audience that might be more inclined to using spreadsheets in the first place. Also, for those interested, I started writing my below code as part of an effort to gather historical data off the individual coin pages (another good topic for a potential followup article!). As it turns out though, cmc’s API documentation lists historical data as “Coming Soon,” so, if you're looking for that sort of thing, you're going to need to scrape anyways.

Lastly, everything I'm going to share with you is intended for personal research and other non-commercial use. If you need guidance on web scraping etiquette, here are a few good pointers. Otherwise, don't be a (code) jerk by running or applying anything below in a manner that would be disruptive to the fine folks at coinmarketcap dot com!

All that said, overview is as follows:

Setup

Read in the "View All" table

Clean up the name column

Clean out special characters

Create analysis columns

Export to a csv file

Final Thoughts

Let’s go!

Setup

As I’ve referenced in my previous articles, the quickest way to get started from scratch is to skim this primer on R, and then download RStudio. Think of RStudio as your one-stop-shop to write code (in the text editor module) and then run it (in the included console).

Once in R, you’ll need to install rvest, which includes all the functions we’ll need to read-in data contained in a website’s html tables:

install.packages( "rvest" ) library( "rvest" )

If you’re installing it via RStudio, it should already point you towards a CRAN mirror (repositories where you can download packages). In case not, remove the “#” from the chooseCRANmirror(0) call, type the number for one of the existing mirrors (I usually pick ‘USA (IA)’) into your Console/Terminal, hit enter, and then run the rest of the code.

Read in the table

Time to rock & roll! Execute the below rvest functions to grab the coinmarketcap "View All" table:

cmc_url <- "https://coinmarketcap.com/all/views/all/" cmc <- cmc_url %>% read _html() cmc <- cmc %>% html_nodes(xpath = '//*[@id="currencies-all"]' ) cmc <- cmc %>% html_table() %>% data.frame() cmc <- cmc[,-1]

Couple things to discuss here. First, in case you're unfamiliar with it, the %>% operator used here is from the magrittr package, which rvest automatically imports for you. In our code, %>% is being used as another means to nest each function. For instance, cmc_url %>% read_html() is the same thing as read_html(cmc_url). Either way, we'll use read_html() to pull in the raw html text of the "View All" page. The result isn't workable yet though, and contains much more than just our table. In order to refine it, we need to find the table's XPath, and then use it to further parse our html with html_nodes().

Finding the right XPath is a bit tricky. My approach was as follows:

While using Chrome to browse the "View All" page, right click around the table and select "Inspect" (Safari, FireFox, and most other browsers have a variant of this as well).

Remember that you're looking for a table. So, depending on where you clicked, you might need to expand a few <div> boxes until you find something like <table class="table">:

Once you find the right one, right click again and then navigate through to copy the XPath:

The one I found that works is //*[@id="currencies-all"], which we'll use as an argument to html_nodes(). After that, simply transform the table into an R dataframe, strip out the html table's index column (which we won't need), and then move on to the next step.

Clean up the name column

Now that we have everything in a nice, convenient R data frame, let's look at a few things in cmc:

str(cmc) head(cmc[,c(1,5)])

As you'll notice, all of the variables (columns) are character strings. If we want to do any kind of quantitative analysis on them, we'll need to transform some of the columns into the correct data-types, using functions like as.numeric().

Before that though, we need to use basic text replacement techniques to clean out all the line feeds and other special characters that many of them contain. Let's start with the Name column. When initially scraped, the cmc$Name comes out looking like this:

> cmc $Name [1:3] [1] "BTC

Bitcoin" "ETH

Ethereum" "XRP

Ripple"

The goal here will be to split up each string using strsplit(), grab the isolated name of each coin, and then splice it back into our original data frame:

name <- strsplit(cmc$Name, "

" ) name <- sapply(name, `[` , 2 ) name <- gsub( " " , "-" ,name) cmc$Name <- name

The split occurs based on the "

" line feed character, for which I have two points of caution: on the one hand, if you're running this on Windows, you might need to edit this to "\r

" (I'm actually not sure if this will matter, but in case it does, further reading on that topic here). On the other, I'd check it anyway, given that I've seen it occasionally include a few TAB spaces like this: "[ TAB ][ TAB ]

". Once you get the right character in place, the output will look as follows:

> name[1:3] [[1]] [1] "BTC" "Bitcoin" [[2]] [1] "ETH" "Ethereum" [[3]] [1] "XRP" "Ripple"

The output is a list of paired crypto symbols and names. Now, we already have a column in cmc with all the symbols, so we only have to focus on extracting the names in the 2nd element of each pair. We do that with the sapply() call on our temporary name vector, using the '[' bracket as an operator/ function. A bit confusing, yes, thankfully here's further reading on use of '[' and '[[' in the apply() family of functions. The result of that will be a nice, straight vector that we can swap directly back into the original cmc$Name.

Clean out special characters

Elsewhere in the initial data, you'll notice the "*" and "?" characters in the Circulating Supply column, each used respectively to represent coins that aren't mineable (a good quick read on what crypto mining is, if you're unfamiliar), or those who's supply isn't known for some reason. The "?" character is similarly used again in the Price & Market Cap columns. Ultimately, we'll have to strip those out in order to be able to sort properly. The information the special characters convey is very useful though. Before cleaning them out, we can use the grepl() function (more reading on the grep family of functions here) to find which rows had special characters, and then note that in separate Logical columns:

cmc $Not_Mineable <- grepl( "\\*" ,cmc $Circulating .Supply) cmc $Unk_Mkt_Cap <- grepl( "\\?" ,cmc $Market .Cap) cmc $Unk_Price <- grepl( "\\?" ,cmc $Price ) cmc $Unk_Supply <- grepl( "\\?" ,cmc $Circulating .Supply)

As you'll notice, we need to make liberal use of escape notation in order to use special characters as arguments to functions. Once done, the result is a set of quick TRUE/FALSE columns denoting coins that are either un-mineable or have unknown supplies, prices, or market caps, all which can help us better filter the list.

Almost done cleaning up. Now that you've captured the information conveyed by the special characters in separate columns, you're free to clean them out:

cmc $Market .Cap <- gsub( "\\?" , "" ,cmc $Market .Cap) cmc $Price <- gsub( "\\?" , "" ,cmc $Price ) cmc $Circulating .Supply <- gsub( "\\?|

|\\*" , "" ,cmc $Circulating .Supply)

As is visible in our code, we've used gsub() to replace "*","?", and the line feed characters with nothing (denoted by ""), effectively removing them from each column. The resulting figures will be in #,##0, $#,##0, and $#,##0.00 formats, all which will work perfectly well Excel filters and functions. In R, though, they'll still be recognized as character strings. Because we still need to make some calculations on each prior to exporting, we should take the additional steps of removing the "$" and "," characters, while also transforming the data into actual numbers. Rather than disrupt the "$" and "," aesthetic for use in the eventual csv (Excel will have no problem sorting columns with "$" and ","), I decided to place the results in separate columns:

cmc$Mkt_Cap_sort <- ( as .numeric(gsub( "\\$|," , "" ,cmc$Market.Cap))) cmc$Price_sort <- as .numeric(gsub( "\\$|," , "" ,cmc$Price)) cmc$Circ_Supply_sort <- as .numeric(gsub( "," , "" ,cmc$Circulating.Supply))

For each, you might get the "NAs introduced by coercion" warning, which is no problem, as each NA will correspond to one of the unknown "?" data points that we can't work with anyway.

Create analysis columns

Now that everything is all cleaned up, you're ready to create additional sort columns that can be used for analysis and general categorization. Of course, anything we do here at this point can easily be done in Excel, but again, why not automate?

To get started, let's look at price. While scanning the "View All" table, you'll notice that many coins are below 1¢ ( below 1₥ even). As an easy first pass, let's make a column to identify all those sub-cent coins:

cmc $Price_below_a_cent <- cmc $Price_sort < 0.01

A quick summary of our new vector yields the following console output, showing us that the majority of coins are priced above 1¢:

> summary(cmc$Price_below_a_cent) Mode FALSE TRUE NA 's logical 1181 350 2

Another thing you might try is to create some kind of normalized measure of value across coins. Normally with conventional investments, you'd probably just use market cap, or a similar calculation of Unit Price x Quantity of Total Units Outstanding. Sure, market cap mainly applies to equities, but the same quick P x Q figure can be used to compare value across many different asset classes, at least enough for a ballpark estimation. Despite such simplicity, crypto message boards are full of weird, rambling diatribes against using market cap as a means to just that. I suspect this is largely due to ignorance new investors might have of the fact that market "cap" is short for "capitalization", and is not indicative of any kind of "cap" (or ceiling) that would prevent one's preferred coin from going "moon". But I digress, and don't really care otherwise 🤠. If our crypto-pals don't want to directly use market cap, then fine, let's at least help them find some way to achieve quicker comparability:

cmc $Price_norm_by_Circ <- cmc $Mkt_Cap_sort / cmc $Circ_Supply_sort [1] cmc $Price_norm_by_Mkt_Cap <- cmc $Mkt_Cap_sort [1] / cmc $Circ_Supply_sort

Bitcoin is still the biggest crypto out there, not just in (gasp) market cap, but also in name recognition. At one point, it was practically a Band-Aid like synonym for cryptocurrencies. So we might as well use Bitcoin's circ and price information as the common reference point. In the above code we try to see what each coin's price would be if:

The coin had the same market cap, divided by Bitcoin's circulation, or Bitcoin's market cap divided by the coin's circulation.

Both are trivial and slightly nonsensical measurements, but still hopefully capable of offering additional cross-comparability.

Beyond price and market capitalization, there's also some interesting stuff we can do with circulating supply. When I started looking at cryptos, one of the first things I tried to do was establish a baseline understanding of how circulating supplies worked for different coins. Like many of you, I heard that Bitcoin itself has a max supply of 21 million, to which I had a few immediate follow-on questions: Is 21 million an industry standard of some sort (no)? Is 21 million circ a common figure (sort of)? What does the overall distribution of crypto circulating supplies look like?

To answer these questions, I originally resorted to copy/pasting circ data into a spreadsheet, where I then manually cleaned it all up while bucketing everything into rough tables & charts. Definitely took quite a bit of time!! Thankfully, here we can feed our clean cmc$Circ_Supply_sort column through a simple for loop to figure all that out in a matter of seconds:

for (i in 1 :length (cmc$Circ_Supply_sort)) { if ( !is.na(cmc$Circ_Supply_sort[i]) & cmc$Circ_Supply_sort[i] <= 1 e7 ) { cmc$Circ_Category[i] <- "1: Below 10M" } else if ( !is.na(cmc$Circ_Supply_sort[i]) & cmc$Circ_Supply_sort[i] > 1 e7 & cmc$Circ_Supply_sort[i] <= 1 e8 ) { cmc$Circ_Category[i] <- "2: 10M - 100M" } else if ( !is.na(cmc$Circ_Supply_sort[i]) & cmc$Circ_Supply_sort[i] > 1 e8 & cmc$Circ_Supply_sort[i] <= 1 e9 ) { cmc$Circ_Category[i] <- "3: 100M - 1B" } else if ( !is.na(cmc$Circ_Supply_sort[i]) & cmc$Circ_Supply_sort[i] > 1 e9 & cmc$Circ_Supply_sort[i] <= 1 e12 ) { cmc$Circ_Category[i] <- "4: 1B - 1T" } else if ( !is.na(cmc$Circ_Supply_sort[i]) & cmc$Circ_Supply_sort[i] > 1 e12 ) { cmc$Circ_Category[i] <- "5: Above 1T" } else { cmc$Circ_Category[i] <- "Unknown" } }

Now, I recognize that for loops can be a bit intimidating at first. They're certainly not my strong suite as a code hobbyist. If you're in that same boat, here's a good tutorial. Once you get comfortable with our code above though, you'll see that it's all pretty simple. All we're trying to do is determine two things: first, is the circ value in each row an NA, and if not, does it fall within a certain range? Depending on the outcome of each if statement, we then assign a value to that row in a new cmc$Circ_Category column. From there, we can easily table or even plot our results to get a good feel for the distribution of circ:

# Table and plot out the results! as.data.frame(table(cmc$Circ_Category)) bp < - barplot ( table ( cmc $ Circ_Category ), col = rgb(0,119/255,181/255), ylim = c(0,500), main = "Cryptos by Circulating Supply" , xlab = "Category" , ylab = "Count" ) text ( x = bp, y = table(cmc$Circ_Category), label = table(cmc$Circ_Category), pos = 3, cex = 0.8, col = "red" )

Again, none of our table or chart work in R is necessary to craft our csv (which is essentially ready at this point!), but I like doing so as another means to check results before exporting. A few notes on each, starting with the table:

> as .data.frame(table(cmc$Circ_Category)) Var1 Freq 1 1 : Below 10 M 373 2 2 : 10 M - 100 M 443 3 3 : 100 M - 1 B 246 4 4 : 1 B - 1 T 107 5 5 : Above 1 T 2 6 Unknown 374

As you can see, the Var1 column takes on the values of our cmc$Circ_Category vector. I named them with a "1: ...", "2: ...", etc. convention as an easy way to order them properly, because the default for table() is to sort alphabetically. They'll also serve as decent labels for a bar chart, either in Excel or R itself:

In my code for the above chart, you'll first see the bp line for the barchart itself, and then the text line for our fancy-pants red label numbers. Given that base-R plot formatting is a vortex of pain & frustration, I don't want to spend too much more time explaining either. However, I would point out that yes, those bars are indeed "LinkedIn Blue"(always check the brand guide!), and yes, it's easy enough to use custom colors in your charts once you learn how to convert rgb into R plot parlance (ahh! the vortex!! 😱).

Export to a csv file

Almost done! Review your data.frame one last time, as is prudent:

str(cmc)

The result will print out the in-order name, data type, and first few values of each column. Should look something like this:

> str(cmc) 'data.frame' : 1545 obs. of 20 variables: $ Name : chr "Bitcoin" "Ethereum" "Ripple" "Bitcoin-Cash" ... $ Symbol : chr "BTC" "ETH" "XRP" "BCH" ... $ Market.Cap : chr "$186,425,595,720" "$93,949,921,856" "$46,140,490,585" "$27,044,016,964" ... $ Price : chr "$11049.80" "$961.77" "$1.18" "$1593.30" ... $ Circulating.Supply : chr "16,871,400" "97,684,701" "39,009,215,838" "16,973,588" ... $ Volume. .24 h. : chr "$9,076,180,000" "$2,572,700,000" "$1,255,570,000" "$779,904,000" ... $ X. .1 h : chr "0.16%" "-0.18%" "0.33%" "-0.01%" ... $ X. .24 h : chr "4.69%" "-0.27%" "3.03%" "2.61%" ... $ X. .7 d : chr "34.61%" "17.31%" "20.31%" "29.80%" ... $ Not_Mineable : logi FALSE FALSE TRUE FALSE FALSE TRUE ... $ Unk_Mkt_Cap : logi FALSE FALSE FALSE FALSE FALSE FALSE ... $ Unk_Price : logi FALSE FALSE FALSE FALSE FALSE FALSE ... $ Unk_Supply : logi FALSE FALSE FALSE FALSE FALSE FALSE ... $ Mkt_Cap_sort : num 1.86e+11 9.39e+10 4.61e+10 2.70e+10 1.26e+10 ... $ Price_sort : num 11049.8 961.77 1.18 1593.3 227.98 ... $ Circ_Supply_sort : num 1.69e+07 9.77e+07 3.90e+10 1.70e+07 5.53e+07 ... $ Price_below_a_cent : logi FALSE FALSE FALSE FALSE FALSE FALSE ... $ Price_norm_by_Circ : num 11050 5569 2735 1603 747 ... $ Price_norm_by_Mkt_Cap: num 11049.8 1908.44 4.78 10983.28 3373.18 ... $ Circ_Category : chr "2: 10M - 100M" "2: 10M - 100M" "4: 1B - 1T" "2: 10M - 100M" ...

If yours looks radically different, now is your chance to go back and fix it. Once ready, set your working directory to somewhere easy to find, and export your csv:

setwd( "~/Downloads" ) write.csv(cmc, file = paste( "cmc_scrape_" ,Sys.Date(), ".csv" , sep= "" ))

In the file argument, I always like using paste() to combine a title with a date stamp via Sys.Date(). Also, if you're not doing this on a Mac, you might need to change the setwd() call to work with Windows or other operating systems. Otherwise, your result should be a nice, clean csv file, ready to help you navigate cryptocurrency mania!

Final Thoughts

As with previous articles, I've posted my full script to GitHub here. Feel free to take it and modify it to your heart's content (albeit while still practicing good web-scrape etiquette!

Also, I'd like to make a special shout-out to my pal Drew Elliot, a good friend and former roommate back during our USMC days. Drew now runs Coin Savage, a crypto investment blog & chart service. I wrote most of this article last week while on vacation, having approached the topic initially as a cool subject to explore in R. Drew was supremely helpful in giving me a crash course on all things crypto, both by phone and via his blog posts (I particularly enjoyed "Time To Build a Cryptocurrency Portfolio", and it's beginner-friendly explanation on why "Crypto is More Than Bitcoin"). Anyways, hats off to Drew, thanks again 👏👏👏!!

That's about all. Thanks as always for reading, let me know what you think!

-Matt