Just to add to the conversation:

I’ve just recently found out about IPFS and to me it seems like it can potentially be really positive for science reproducibility.

In my particular research community, large (up to around 10TB) binary files are generated through very time-consuming simulations. Storing them appropriately is a big deal (losing files means having to repeat simulations that can span several months). Sharing them with colleagues is of course also really important and is something that is not always doable in practice, unfortunately. For example, I can’t download simulation datasets of several Terabytes that are hosted at Stanford’s repository, since I am based in Europe, and would take me an absurdly long time to do so.

From what I’ve gathered in my short time reading about IPFS, the whole point is to increase file sharing speed through talking to your nearest neighbour in the network, and not necessarily a central repository. But I’ve also read that duplication is avoided, and that each node in the network stores only content it is ‘interested’ in. Therefore, in the case that I mentioned before, how would IPFS decide who stores these large datasets? Wouldn’t it be too costly to have them duplicated? If so, we would be back at the situation that I am now: downloading a huge dataset from across the globe is infeasible.

I’m interested in reading comments on this from more knowledgeable members of the IPFS community