To access the CryptoCompare public API in Python, we can use the following Python wrapper available on GitHub: cryCompare.

% load_ext autoreload % autoreload 2 import numpy as np import pandas as pd from joblib import Parallel , delayed import operator import matplotlib.pyplot as plt from crycompare import * from ClusterLib.clusterlib import * from ClusterLib.distlib import * % matplotlib inline

With the coinList() function we can fetch all the available cryptocurrencies (about 1450).

p = Price () coinList = p . coinList () coins = sorted ( list ( coinList [ 'Data' ] . keys () ))

With the histoDay() function we can fetch the historical data (OHLC prices and volumes) for a given pair. We keep only coins which have a non-trivial history (about 1350).

h = History () df_dict = {} for coin in coins : histo = h . histoDay ( coin , 'USD' , allData = True ) if histo [ 'Data' ]: df_histo = pd . DataFrame ( histo [ 'Data' ]) df_histo [ 'time' ] = pd . to_datetime ( df_histo [ 'time' ], unit = 's' ) df_histo . index = df_histo [ 'time' ] del df_histo [ 'time' ] del df_histo [ 'volumefrom' ] del df_histo [ 'volumeto' ] df_dict [ coin ] = df_histo

We store all info in a dataframe with 2-level columns: the first level contains the coin names, the second one, the OHLC prices.

crypto_histo = pd . concat ( df_dict . values (), axis = 1 , keys = df_dict . keys ())

histo_coins = [ elem for elem in crypto_histo . columns . levels [ 0 ] if not elem == 'MYC' ]

Since many coins are quite recent, many have relatively short time series of historical data. We sort them by the decreasing length of their time series. BTC (Bitcoin) has the longest one, as expected.

histo_length = {} for coin in histo_coins : histo_length [ coin ] = np . sum ( ~ np . isnan ( crypto_histo [ coin ][ 'close' ] . values ) ) sorted_length = sorted ( histo_length . items (), key = operator . itemgetter ( 1 ), reverse = True )

For the following, we will only consider the 300 longest time series.

# we keep the 300 coins having the longest time series of historical prices sub_coins = [ sorted_length [ i ][ 0 ] for i in range ( 300 )] sub_crypto_histo = crypto_histo [ sub_coins ] sub_crypto_histo . tail ()

POT TEK BTC ... DGD PIGGY DIEM close high low open close high low open close high ... low open close high low open close high low open time 2017-08-21 0.1282 0.1368 0.1265 0.1321 0.000120 0.000451 0.000079 0.000122 4005.10 4097.25 ... 74.14 78.93 0.000681 0.000696 0.000555 0.000610 0.000040 0.000041 0.000040 0.000041 2017-08-22 0.1267 0.1411 0.1089 0.1282 0.000123 0.000166 0.000108 0.000120 4089.70 4142.68 ... 67.84 78.34 0.000614 0.000704 0.000397 0.000681 0.000041 0.000041 0.000036 0.000040 2017-08-23 0.1245 0.1335 0.1191 0.1267 0.000124 0.000170 0.000081 0.000123 4141.09 4255.62 ... 76.57 78.85 0.000621 0.000809 0.000570 0.000614 0.000041 0.000043 0.000041 0.000041 2017-08-24 0.1289 0.1324 0.1211 0.1245 0.000130 0.000131 0.000082 0.000124 4318.35 4364.11 ... 80.01 89.70 0.000648 0.000698 0.000576 0.000621 0.000043 0.000044 0.000041 0.000041 2017-08-25 0.1335 0.1576 0.1266 0.1289 0.000133 0.000179 0.000086 0.000130 4442.46 4461.71 ... 90.69 93.97 0.000622 0.000669 0.000602 0.000648 0.000044 0.000045 0.000043 0.000043 5 rows × 1200 columns

All these 300 time series have at least 1000 days of observed prices. We will only consider these days for the correlation study.

N = len ( sub_coins ) recent_histo = sub_crypto_histo [ - 1000 :]

Below, we compute their daily log-returns.

returns_dict = {} for coin in sub_coins : coin_histo = recent_histo [ coin ] coin_returns = pd . DataFrame ( np . diff ( np . log ( coin_histo . get_values ()), axis = 0 )) returns_dict [ coin ] = coin_returns recent_returns = pd . concat ( returns_dict . values (), axis = 1 , keys = returns_dict . keys ()) recent_returns . index = recent_histo . index [ 1 :]

recent_returns = recent_returns . replace ([ np . inf , - np . inf ], np . nan ) recent_returns = recent_returns . fillna ( value = 0 )

recent_returns . isnull () . values . any ()

False

plt . figure ( figsize = ( 40 , 10 )) for coin in sub_coins : plt . plot ( recent_returns [ coin ]) #plt.legend(sub_coins,loc='upper left') plt . xlabel ( 'time' , fontsize = 18 ) plt . ylabel ( 'returns \' X/USD \' ' , fontsize = 18 ) plt . show ()

Notice below that the scale is pretty huge compared to other financial assets (which are usually contained in a (-0.15,0.15) range, with some tails valued at ~2 or 3).

Now, we compute a correlation/distance matrix between all these coins. Notice that we consider here the OHLC representation, and thus we have to compute a correlation between random vectors, and not random variables (which is usually done by considering only the ‘close’ price for example). The distance correlation is a relevant measure of statistical dependence for that purpose. We apply it between the 300x299/2 = 44850 pairs in parallel using the joblib library.

dist_mat = np . zeros (( N , N )) a , b = np . triu_indices ( N , k = 1 ) dist_mat [ a , b ] = Parallel ( n_jobs =- 2 , verbose = 1 ) ( delayed ( distcorr )( recent_returns [ sub_coins [ a [ i ]]], recent_returns [ sub_coins [ b [ i ]]]) for i in range ( len ( a ))) dist_mat [ b , a ] = dist_mat [ a , b ]

[Parallel(n_jobs=-2)]: Done 39186 tasks | elapsed: 19.3min [Parallel(n_jobs=-2)]: Done 42036 tasks | elapsed: 20.7min [Parallel(n_jobs=-2)]: Done 44850 out of 44850 | elapsed: 22.1min finished

Then, using the dendrogram obtained from the Ward hierarchical clustering method, we can sort the coins so that their correlation/distance matrix is more readable.

seriated_dist_mat , res_order , res_linkage = compute_serial_matrix ( dist_mat ) ordered_coins = [ sub_coins [ res_order [ i ]] for i in range ( len ( res_order ))]

plt . figure ( figsize = ( 80 , 80 )) plt . pcolormesh ( seriated_dist_mat ) #plt.colorbar() plt . xlim ([ 0 , N ]) plt . ylim ([ 0 , N ]) plt . xticks ( np . arange ( N ) + 0.5 , ordered_coins , rotation = 90 , fontsize = 25 ) plt . yticks ( np . arange ( N ) + 0.5 , ordered_coins , fontsize = 25 ) plt . show ()

We can observe that some coins do cluster together as they are correlated and uncorrelated to the rest of the coins in a similar way.

For example, we obtain the following clusters (if we ask for 30 groups).

nb_clusters = 30 cluster_map = pd . DataFrame ( scipy . cluster . hierarchy . fcluster ( res_linkage , nb_clusters , 'maxclust' ), index = ordered_coins ) clusters = [] k = 0 for i in range ( 0 , nb_clusters ): compo = cluster_map [ cluster_map [ 0 ] == ( i + 1 )] . index . values clusters . append ( compo ) k = k + len ( compo )

for cluster in clusters : print ( cluster )

['GEMZ' 'MINT' 'NMC'] ['BTCD' 'MIL' 'SFR' 'TRK' 'FC2' 'AUR' 'ALN' 'LKY' 'CLOAK' 'URO' 'YBC' 'XPM' 'EMD'] ['PIGGY' 'PEN' 'C2' 'DMD' 'POINTS' 'XLB' 'RMS' 'AC' 'VDO' 'AXR' 'XSI'] ['LXC' 'XMY' 'PSEUD' 'GUE' 'LK7' 'SAT2'] ['UFO' 'HZ' 'NSR' 'UNO' 'SLG' 'GRS' 'AMC' 'SPA' 'MRY'] ['ACOIN' 'RPC' 'DSB' 'UIS'] ['XMR' 'NRS' 'OPAL'] ['DGC' 'CACH' 'XXX' 'DGD'] ['OMNI' 'QTL' 'NODE' 'BLK' 'LTB' 'ULTC'] ['LTC' 'MEC'] ['HUC' 'CANN' 'NBT'] ['TAG'] ['EMC2' 'CRAIG' 'CSC' '42' 'HBN' 'CAP' 'EAGS' 'GML'] ['FLT' 'TEK' 'GB' 'SXC' 'BLU' 'TOR' 'KDC' 'GLC' 'CYC'] ['NAUT' 'XCP' 'EXE'] ['ZNY' 'BITS' 'SAR' 'ANC' 'GLX' 'PYC' 'OSC'] ['SYS' 'BURST' 'HYP' 'PTS' 'OK' 'EXCL' 'IFC' 'KEY' 'XBOT' 'VTX' 'SUPER'] ['NOBL' 'EFL' 'CESC' 'BSTY' 'CASH' 'USDE' 'RED'] ['CRW' 'GLYPH' 'NRB' 'DEM' 'LSD' 'FRC' 'PXC' 'ARG' 'UTIL'] ['VTC' 'XST' 'FLDC' 'BAY' 'BELA' 'FIBRE' 'CKC' 'GAP' 'BTS' 'LTS' 'CCN' 'SMLY' 'BTQ' 'SCOT' 'START' 'MZC' 'ORB' 'PXI' 'LOG' 'IOC' 'CRYPT' 'XC' 'BTM' 'XCN' 'LYC' 'DVC' 'CLR' 'MED' 'FRAC' 'STR*' 'DT' 'NMB' 'TGC' 'NAN' 'CNL' 'AGS' 'SHADE' 'CRACK' 'LTCX' 'HVC' 'EZC' 'LTCD' 'NXTI' 'BUK' 'JKC' 'ZCC' 'COMM' 'DRKC' 'CON' 'MOTO' 'SDC' 'FRK' 'XPY' 'CIN' 'CAM' 'CMC' 'ELC' 'XBS' 'FFC' 'TTC' 'SPT' 'KGC' 'XJO' 'IXC' 'SMC' 'XG' 'DANK' 'GHS' 'IPC' 'KARM' 'CRC' 'MARS' 'CARBON'] ['SILK' 'WC' 'MAX'] ['NAV' 'CLAM' 'MSC' 'APEX' 'BQC' 'MMC' 'SONG' 'PRC' 'CNC' 'VOOT' 'MNC' 'SSV' 'XCASH' 'SOLE' 'NBL' 'NVC' 'PPC' 'DOGE' 'SRC' 'TRC' 'OBS'] ['XWT' 'FTC' 'NOTE' 'GLD' 'XCR' 'XRP' 'USDT' 'DIEM' 'WDC' 'MARYJ' 'BOST' 'HAL' 'NYAN'] ['VRC' 'VIA' 'DGB' 'DASH' 'BBR' 'SLM'] ['XMG' 'SSD' 'JBS' 'GP' 'MAD' 'J' 'SWIFT' 'NEC'] ['DOGED' 'MLS' 'SYNC' 'TIT'] ['GRC' 'XTC' 'EAC' 'CINNI' 'RYC' 'COOL' 'MN'] ['YAC' '2015' 'UNITY' 'ARCH' 'NXTTY' 'NET' 'SPR' 'UTC' 'BLOCK' 'NXT' 'RDD' 'ICB' 'SHLD' 'RZR' 'BCX' 'BTE' 'ALF' 'BEN' 'GDC' 'AERO' 'RT2' 'LAB' 'MIN' 'RIPO' 'JUDGE' 'ZED' 'TES' 'BTC' 'RICE'] ['LGD' 'TAK' 'CCC' 'MNE'] ['WOLF' 'QRK' 'ZET' 'SBC' 'POT' 'UNB' 'FST' 'PHS' 'BTB' 'BTG' 'OMA' 'GIVE' 'NKA']

We can use these clusters to average the values of the correlation-distances inside and in-between the clusters. We obtain the following filtered matrix:

display_filtered_distances_using_clusters ( seriated_dist_mat , clusters )

For example, below are the ‘close’ log-prices of one rather strong cluster: