Many BP nodes were down on July 8th, because the default setting of “chain-state-db-size-mb” is 1GB, and this is the moment when the EOS database has reached this limit. This parameter specifies the amount of virtual, disk-backed shared memory, that is used by nodeos daemon at its runtime, and if the database exceeds this limit, the node crashes.

At the start, nodeos creates a sparse file that represents its virtual memory, and the size is as specified in “chain-state-db-size-mb”. “du” and “df” utilities show the actual usage on the file system, so the sparse file will be shown as occupying a very little place.

Once you realize that the available database memory is scarce, you can stop the daemon, increase the DB size setting, and start again. As long as the daemon stops and starts gracefully, this operation is safe.

Not all utilities understand sparse files, and scp will produce a dense file on the target system if you make a copy.

If nodeos crashes, it leaves the memory database in a dirty state, and a full replay of the blockchain is required. This may take up to a half a day or longer for the server to re-build its state database.

The best approach for running a production EOS server, is to have a second server in the background, stop it periodically and make a backup of its database. In this case, should the production node crash, you can quickly restore its database from the snapshot, and it will quickly re-sync with the network, as the backlog would only be worth few hours.

Also there’s an RPC plugin that reports the actual memory usage:

plugin = eosio::db_size_api_plugin

With this plugin, you get the total database memory, current use, and free amount:

The production nodes need to be monitored periodically for memory usage, and the administrators should be alerted should the memory consumption get above 80%.