So, my openhab system periodically decides to leave the building. Appears there is a problem from time to time when the z-wave binding loses communication to the z-wave stick it gets upset and tells openhab to take a hike.

This is bad. Once because it exposed something I missed in my fault tolerance. I had compensated for network issues and full machine failover. But the actual process going belly up…. ooops. My Bad.

Soooo I see it crash while at the gym today and the only thing in my head….

So I appear to have done that.

Let me bring you up to speed on the current state of my home automation. After the great NAS failing of 2015 I was forced to reduce some of my virtual environment. I have not brought my secondary HA controller back online yet. However, it appears that still using keepalived I am able to help address this random problem.

I have added in a new option in my keepalived.conf

vrrp_script chk_hahealth { script "/usr/local/sbin/healthcheck.sh" interval 10 # check every 10 seconds fall 2 # require 2 failures for KO rise 2 # require 2 successes for OK } vrrp_instance VI_1 { state MASTER interface eth0 virtual_router_id 220 priority 150 notify /usr/local/sbin/notify-keepalived.sh advert_int 1 authentication { auth_type PASS auth_pass fakepass } virtual_ipaddress { 192.168.2.90 } track_script { chk_hahealth } }

So what this does is add a keepalived health check. Every 10 seconds keepalived runs the script /usr/local/sbin/healthcheck.sh and gets an exit code of 0 or 1. 0 if all is good. 1 if the world fell apart.

The code for this script is

#!/bin/sh SERVICE=openhab; if ps ax | grep -v grep | grep $SERVICE &gt; /dev/null then echo "$SERVICE service running, everything is fine" /usr/bin/logger "$SERVICE service running, everything is fine" exit 0 else echo "$SERVICE is not running" /usr/bin/logger "$SERVICE is not running" /etc/init.d/openhab restart exit 1 fi

Explanation:

So this script just checks to see if the openhab process is running. If its good, exit 0. If its not, exit 1 but go ahead and try to restart openhab. When keepalived gets the exit 1 code it keeps track of it. You will see in the config that there is a fall 2 line. That means that if there are 2 exit 1 status’s keepalived will go into a failed state. When the second HA box is back online this will force openhab to move over to the other one. However, I have not seen this happen so far as openhab loads pretty quick so since there is 10 seconds between the checks the second check comes back with an exit 0 and resets the fall count.