Virtual IP Failover for PostgreSQL in AWS

We at SecureDB rely on the highly rated open source enterprise-class PostgreSQL database. In an earlier blog post, we laid down the reasons as to why SecureDB chose PostgreSQL as its backend of choice.

SecureDB provides Encrypted User Identity Management via RESTful APIs and the cloud-version of our offering is hosted on AWS. We take high availability very seriously and along with all the other goodies of regular logical and physical backups (we back them into S3 and also offline), we configured our PostgreSQL backend with streaming replication. Here’s how we did virtual IP failover for PostgreSQL in AWS.

To seamlessly allow for failover in case the master goes down, we used a virtual IP Address failover mechanism for PostgreSQL in our AWS environment. The idea is that the failover node or another third “witness” node can monitor the master instance to which the virtual IP address is assigned to and in case the master node becomes unreachable the slave node can take over the virtual IP address after ensuring that it has been promoted successfully to become the new master.

Majority of the inspiration came from an AWS article. But the major difference is that this article talks about instances in the public subnet of the VPC. In our case, for security concerns and also because of our architecture, the PostgreSQL instances were part of a private subnet with no direct access to the outside world. So we employed the same technique for our private subnet based instances.

Without going into too much of AWS details about VPC and Security Groups (there’s plenty of documentation on the AWS website for this), create both the “Master” and “Slave” instances in your private subnet.

While creating “Master”, assign let’s say private IP “198.0.0.22” and secondary private IP “198.0.0.21“. This secondary private IP will be virtual IP which will be floating between the master and the slave instances:

Private IP and Secondary Private IP of AWS Private Subnet INstance

While creating “Slave”, assign a private IP “198.0.0.23” for example.

Please take care to ensure that the virtual IP “198.0.0.21” can be assigned to either of the instances properly. We use Centos 6.5 and had to configure the proper entries in /etc/sysconfig/network-scripts on both the VMs to allow assignment of this virtual IP. Here’s how a sample config looks on both “Master” and “Slave” instances:

cat /etc/sysconfig/network-scripts/ifcfg-eth0:0
DEVICE="eth0:0"
IPADDR=198.0.0.21
MTU="1500"
TYPE="Ethernet"
NAME="eth0:0"
NETMASK=255.255.255.0
NETWORK=198.0.0.0
ONPARENT=yes

cat /etc/sysconfig/network-scripts/ifcfg-eth0:0

DEVICE="eth0:0"

IPADDR=198.0.0.21

MTU="1500"

TYPE="Ethernet"

NAME="eth0:0"

NETMASK=255.255.255.0

NETWORK=198.0.0.0

ONPARENT=yes

Now both the instances are ready to handle virtual IP re-assignments via AWS cli. Please configure PostgreSQL streaming replication on the “Slave” node in the standard way after this.

We then need to come up with a monitor script. Again AWS has provided a virtual IP failover script which can be customized as per one’s requirements. We need to setup proper values for the variables in the shell script. The notable ones are:

HA_Node_IP - This should point to HA Node #2’s primary private IP address (198.0.0.22 in this example).

VIP - This should point to private virtual IP address that will float between the two HA Nodes (198.0.0.21 in this example).

REGION - This should point to region where your HA nodes are running (us-east-1 in this example).

We modified this script to allow temporary ping failures. Only if the ping fails consecutively for 3 or more times do we issue the virtual IP failover command to fail the IP over to the “Slave” instance. It also additionally emails us after each ping failure to let us know about the issue at hand. Here’s a sample snippet from our script:

if [ $FAIL_CNT -lt 3 ]
then
    echo "HA script. Master unreachable Rechecking after $SLEEP_SECS seconds. Try count $FAIL_CNT. Limit 3" | mailx -r  [email protected] -s "HA Script"  -v [email protected]
    sleep $SLEEP_SECS
    continue
fi
echo `date` "-- HA heartbeat failed, taking over VIP"
aws ec2 assign-private-ip-addresses --network-interface-id $ENI_ID --private-ip-address $VIP --allow-reassignment

That’s it! Ensure that your virtual IP is assigned and reachable from the “Master” node and run the above script forever from the “Slave” node. You could also run this script from another “Witness” node if desired. Since it uses the AWS cli to make the IP re-assignment, the virtual IP gets re-assigned appropriately and AWS updates its internal routing metadata correctly allowing for routing of queries to the newly promoted replica on the “Slave” node. Cannot over-emphasize the importance of testing all scenarios before deploying it in your production environment! We had to resort to this virtual IP based failover once so far in the last year. Obviously, you cannot use this across multiple regions, but it works well enough if you want to do HA within your region/zone. Hope it helps you as well!

Virtual IP Failover for PostgreSQL in AWS