A data-driven approach to select premium cryptocurrencies using social/developer data

Image credit: Deposit Photos

Similarity Matters

As the blockchain hype continues, a myriad of cryptocurrencies have been created, many of which are claimed to be the “next Bitcoin” or the “next Ethereum”. Nonetheless, if a coin or token truly have that potential, its footprint on the social media and the development community should also be very similar to those of the premium coins. Charles Darwin once said: “A man’s friendships are one of the best measures of his worth.” The same principle works for cryptocurrencies.

How can we find these coins/tokens that are “close friends” to the premium cryptocurrencies like Bitcoin and Ethereum? A general idea is to first collect data on different aspects of a large set of coins, especially in social media and development communities. Next, we can use these data to create features to measure the similarity among the coins/tokens. Finally, the ones that are most similar to the premium coins are classified as the promising ones, and the ones that are distant to those are the junk coins whose value may eventually slip to zero.

Quantify Coin-to-Coin Similarity

Data Collection

Cryptocurrencies have a large number of features, e.g. daily trading volume, daily volatility, number of twitter followers, etc. In this research, we concentrate on the social and development aspect of these. We collect data for major social media followers and interactions, including Twitter, Telegram, etc. Meanwhile, we also keep track of the developer community activities like the number of forks, stars, issues, etc. In total, we collect about twenty different features.

Defining Similarity

Due to the fact that our features are heterogeneous (a mixture of categorical and numerical data of different distributions), and are of different importance, we cannot simply resort to some conventional measures of distance used in a homogeneous linear space (e.g. Euclidean distance), and hence apply these distance to derive similarity, without distorting the true relative importance among these features.

Therefore, we employ the random-forest embedding approach, which provides an effective method to map original heterogeneous data to a very high-dimensional, sparse representation. This non-linear embedding preserves proximity in the sense that if two data points are very similar in the original feature space, they tend to be classified into the same leaf node in each random tree, and hence have sparse binary vectors with very short hamming distance. As a result, we can simply use the inner product between two sparse vectors in this new high-dimensional space with homogeneous features as the measure of similarity.

Create a Similarity Matrix

In our example, we collect data from more than N= 4200 different cryptocurrencies, and expand the 20-D space into an 11,000-D space. This N-by-N similarity matrix, as shown in the figure below, implicitly defines a high-dimensional graph, the stronger the similarity between two coins, the thicker the edge between them. In order to reveal the structure of this graph, we optimize for a 2-D embedding of this 4200-D graph to analyze and visualize the relationship among all the coins, and hence review which coins are closer neighbors to the premium coins like Bitcoin.

Heat map of the similarity matrix of 4200 coins, using inner-product of random-tree-embedding hash as the measure for similarity, sorted by row-sum.

Visualize the Inter-Coin Relationships

2-D Embedding and Visualization

The similarity matrix, though containing very detailed information of the relationships among all the coins, is very hard to comprehend. Therefore, we emphasize the major structure of the graph by embedding it into a 2-D space, trying to preserve the proximity among nodes as much as possible, i.e. similar points in the high-dimensional space are also closer in distance in the 2-D map, and vice versa for the dissimilar points.

Among the many available embedding methods, we choose the t-SNE algorithm for this task, and below is the 2-D embedding that we obtain:

2-D t-SNE colored by lg of the number of Twitter followers

In the plot above, we colored the points by the number of Twitter followers. Interestingly, it turns out that everything appears to stem from BTC, which is the right-most point of all the coins. The more left/lower points are correspondingly less Twitter-followed. If our conjectors that the more similar a coin is to Bitcoin in distance, the better value and status it has, then we should be able to find all those high-ranking coins in the vicinity of Bitcoin.

Cross-validation with Market-cap/Volume Ranks

Zooming into the vicinity of Bitcoin, as shown in the plot below, we do find most of the familiar mainstream coins that rank high on the ranking website like Coinmarketcap and Coingecko, which is a very good sign, considering that we reach similar ranking results without relying on any trading/volume related data. This coincidence is a strong support to our data-driven approach to rank coins using social/development data.

2-D t-SNE colored by lg of the number of Twitter followers, zoomed in near Bitcoin

Happy Discoveries of Newcomers

In addition, we can use this embedding to find some promising newcomers. Some less well-known coins also pop into this busy field. For example, Grin (GRIN), a new coin ranked 200+ on both Coinmarketcap and Coingecko (2019/03/22), appeared in the vicinity of Bitcoin. Though still not traded as heavily as its counterparts, Grin actually has a quite large community followers and exchange coverage, which is more quickly picked up by our sets of features rather than volumes and market-caps, and is readily reflected in this graph embedding.

An Eerie “Comet”

The shape of this 2-D embedding also reveals the disturbing fact of the family of the cryptocurrencies is quite similar to a comet, with most of its mass centered around its solid core (Bitcoin), and a very loose tail trailing far behind. Although the tail appears to be enormous, it contains very little mass. In the crypto realm, most of the coins in the tail, unfortunately, has a very uncertain future.

Not All Features Are Created Equal

2-D t-SNE colored by lg of the number of Telegram channel user

Previously, the number of Twitter followers seems to color the points well, i.e. nearby points tend to have similar colors. This indicates that the number of Twitter followers is a good feature to distinguish premium cions from cheap ones. However, if we color the points using the number of telegram channel user, we see no obvious pattern. Probably, due to over-aggressive marketing and message bots, the efficacy of the number of Telegram channel users as a feature to select coins essentially vanishes, at least that is what this set of data indicates.

A Bitcoin-Centric Ranking System

The graph visualization in the previous section provides us with two intuitions:

It is a feasible option to rank cryptocurrencies by their distance to Bitcoin To evaluate the status of a cryptocurrencies, we can locate them on the 2-D embedding to see if they are “near the core” or “in the tail”.

Example 1: Cryptocurrency Ranking

Ranking by similarity to Bitcoin, both in the original 4200-D space and in the t-SNE 2-D space. The ranking results are very similar.

The table above demonstrates two ways to rank the cryptocurrencies by measuring how similar they are to Bitcoin, one using the distance characterized by the N by N matrix, the other by the 2-D t-SNE map. It turns out that the results are very close to each other.

Example 2: Status of Different Exchange Platform Tokens

Another example is to locate all the major exchange-platform tokens onto this embedding so that we can visualize the “social status” of all these coins.

It turns out that BNB is the closest to BTC, confirming its commanding status as the strongest platform coin. WAVES follows BNB quite closely. However, other platform tokens do not appear to be as promising as the former two. Apart from BNT, COB, and KCS, all the other coins are very far away from the “core” of the comet. Ironically, the once stellar FT, the platform token of Fcoin, stays on the most distant part of the tail.

Locations of major platform coins on the 2-D t-SNE map. BNB and WAVES are close to BTC, indicating a promising future, while others are rather distant.

Conclusion

In this article, we have applied the random-tree-embedding and t-SNE to visualize the social/development data of more than 4000 cryptocurrencies. The visualization illustrates several observations: