CouchDB does replication, but replication needs to be set up after each server restart. This means you need to ensure that replication is restarted whenever the daemon restarts CouchDB. I have never seen replication stop working without a restart, but I prefer being safe to being sorry about replication. To be perfectly honest, I do not trust that my replication initiation after a soft CouchDB restart works properly either so I prefer to monitor the replication and have a safety mechanism in place to restart replication if needed.

There are several ways to monitor replication. You could fetch the status page of all servers and restart replication on servers with an empty page, but that is a kind of brute force approach in my world. A better solution is to use the replication itself to monitor that it works.

Each server updates their timestamp in CouchDB and this is again replicated to the other servers. This gets us a bit of the way, but not all the way. The server you are checking might have received updates from all the other servers, but you don’t know if it’s pushed out anything to the other servers. To solve this, you can add information about the other servers to the local server as well. This will give you a matrix of server replication status.

For each server, you will see the timestamp replicated from the server and a list of timestamps replicated to that server. The latter often being a generation older than the former. Cron can be used to update this data. The cronjob reads all the server timestamps and updates this servers timestamp followed by a list of the other servers timestamp.

A mapper to get a server id to server status out of the db.



map: function(doc) { emit(doc._id, doc); }

Our monitroing database is called server_status. The design containing the mapper is called collections and the view server_list.

A Ruby database checker that can run on cron.



require 'rubygems' require 'couchrest' require 'json' require 'open-uri' STATUS_DB = 'http://localhost:5984/server_status' COLLECTIONS = 'collections' SERVER_LIST = 'server_list' hostname = ARGV[0] status_db = CouchRest.database!(STATUS_DB) status_view = "#{STATUS_DB}/_design/#{COLLECTIONS}/_view/#{SERVER_LIST}" # Get the current information about this server if available server_status = begin status_db.get(hostname) rescue RestClient::ResourceNotFound {'_id' => hostname} end server_status['time'] = Time.new.to_i # Get the current times of the other servers and update this server's # view of them JSON(open(status_view).read)['rows'].map do |row| {'server' => row['id'], 'status' => row['value']} end.each do |status| unless status['server'] == hostname server_status['servers'][status['server']] = status['status']['time'] end end status_db.save_doc(server_status)

Now you need to determine when to trigger replication restart. This can be handled in the watchdog cronjob. If the highest timestamp seen for this server at other servers is above a threshold, restart replication.

The final loop triggering when the age is above a threshold. The init_replication method just posts a continuous replication trigger to the db:



JSON(open(status_view).read)['rows'].map do |row| {'server' => row['id'], 'status' => row['value']} end.each do |status| if server_status['time'] - status['status']['time'] > THRESHOLD init_replication(status['server']) end unless status['server'] == hostname server_status['servers'][status['server']] = status['status']['time'] end end

Rudimentary init_replication method.



def init_replication(server) target = "http://#{server}:5984" databases = ['server_status'] databases.each do |db| config = { 'source' => "#{db}", 'target' => "#{target}/#{db}", 'continuous' => true } payload = JSON.generate(config) result = Net::HTTP.new('127.0.0.1', '5984').post( '/_replicate', payload, {'content-type' => 'text/x-json'}) unless result.code == 200 p "replication to #{target}/#{db} failed with #{result.code}" end end end

We have a monitoring view of replication ages in our system. It shows the matrix of timestamps as age in seconds rather than the actual timestamp since the age is the important metric.



A bonus of this replication monitoring system is that we can access the status page from a mobil phone and get an accurate picture of the replication status. This doesn’t worry me now, but it did when we first set it up. Now it’s just a part of our general monitoring view.