Postgres automatic and (mostly) transparent) failover with pgpool
I have been working a lot with pgpool recently and I wanted to document my experiments in case it is helpful for others.
In this post I will document step by step how to install postgres/pgpool and how to configure pgpool to automate the failover. I will then install a graphical interface on top of pgpool (that I developed).
Intro
This post is intended as a learning material, it lists step by step how to install postgres/pgpool/repmgr. Since everything can be done at the command line, I will later expand on this set-up to do automated testing of pgpool watchdog. Please note that in my current company I deployed postgres and pgpool as docker images, either in a docker swarm or directly on the servers (i.e. using docker run to start the containers). This is all in the github repo https://github.com/saule1508/pgcluster. If you start a learning journey with pgpool and postgres, using docker can be of good help because it makes experimenting and testing very easy (you can quickly start from scratch).
I will put more emphasis on pgpool concepts or configurations that I found difficult to grasp (sometimes because they are poorly documented). For postgres and repmgr and will essentially list all the steps required to set it up but it is obviously very much recommanded to read the respective docs.
The end result is a set-up composed of 3 Centos 7.6 servers: one primary database and two stand-by databases, with one pgpool on each node made high-available with the watch-dog mode of pgpool. I use repmgr (https://repmgr.org/), not to automate the failover but because it brings well tested scripts on top of postgres replication.
NB: I am using libvirt to provision the VM (I provision one server then clone it), but you can use vagrant, virtual box, VMWare or provision the centos guests manually. If you have a linux host available then I suggest to use libvirt, I documented here a way to provision a centos guest from the command line in a few minutes: Centos guest VM with cloud-init, this is just great.
The end-result will be similar to this:
- diagram
- graphical interface
pgpool: some facts
Before building the set-up, here are some considerations on Pgpool. Pgpool is hard to learn because it can do a lot of things and the documentation is overwhelming and not always clear. There are 3 main functions that are useful, the other are more esoteric or for historic reasons (like query result caching, master-master replication,…). Pgpool has a good community and once we know it we like it
Connection caching
If you think about traditionnal connection pooling in the java world, where the webcontainer manages the connection pool and java threads borrow one connection from the pool for the time of a transaction, then you will be confused by Pgpool. Pgpool does connection caching. When application connects to Pgpool, Pgpool uses one of its child process, then create a connection to postgres and associate the postgres connection to the child process. When the application disconnects from pgpool, the postgres connection is cleaned-up (to release locks for ex.) and it is kept open, so that it remains in the cache for the next time.
If you have modern applications using connection pooling and if you have a lot of different databases or user name, then connection caching of pgpool might not be interesting and it can be disabled (parameter connection_cache=’off’)
Automated failover and (mostly) transparent
Postgres has a strong replication mechanisme (hot standby) in which a primary database streams its transaction logs (wal in postgres terminology, write-ahead-logs) and one or more standby databases receive those write-ahead-log and apply them continuously so that all changes done on the primary are replicated on the standbys. The primary database is open read-write, the stand-by databases are opened read-only. But postgres does not offer any automated way to promote a stand-by database in case of failure of the primary. Furthermore, if a stand-by database is manually promoted and become the new read-write database, all applications connected to the database must now be made aware that they must connect to a new instance.
Pgpool solve those problems, it is used on top of the postgres databases, acting as a proxy, it uses the following mechanisms to detect a failure and trigger a failover:
- health checks: pgpool connects (with user health_check_user) regularly (param health_check_period) to the databases and so is aware when a primary (or a standby) fails. After a retry period (health_check_retry_interval and health_check_retries) it will trigger a failover, which concretely mean calling a script with a number of arguments.
- Alternatively it can use the parameter failover_on_backend_error. This parameter should be off if you use health check. What failover_on_backend error means is that when Pgpool receives an error on one of the connections, it will immediately trigger a failover. So there is no retry. Furthermore is there is no activity then the failure will not be detected and the failover will not occur.
Transparent failover is achieved because clients are connecting to pgpool and not directly to postgres: if a failover happens, then pgpool knows that the primary database has changed and it will destroy then re-create all connections to postgres without any configuration needed on the client side. Of course it is not completly transparent, any statement (query or update) that was being executed will return an error.
But if pgpool is a proxy between the applications and postgres, a failure of pgpool itself would result in a lost of availability. In other words pgpool itself has become the single point of failure and of course it has a very nice mechanism that makes itself HA: the watchdog mode
pgpool watchdog mode
- watchdog adds high availability by coordinating multiple pgpool nodes (one leaded elected via consensus). The leader acquires the VIP (acquiring the VIP is done via a script using ip and arp)
- pgpool nodes communicates via tcp port 9000 (ADD_NODE command, etc.)
- watchdog uses a heartbeat mechanism (optionally it can use a query mechanism instead): a “lifecheck process” sends and receives data over UDP (port 9694) to check the availability of other nodes
pgpool management protocol
pcp exposes its function through pcp (pgpool control protocol) on tcp port 9898. This requires a user (choose postgres) and a password (the combination must be stored in a file /home/postgres/.pcppass), so that the linux user postgres can use pcp_attach_node, pcp_recovery_node, etc…
1. Create first server with centos 7, postgres, repmgr and pgpool
The installation of the centos 7 guest is described in the link http://saule1508.github.io/libvirt-centos-cloud-image.
It needs a file system /u01 for the database (the DB will be in /u01/pg11/data), a file system /u02 for the archived write ahead logs (wal) and the backups (/u02/archive and /u02/backup). I create a user postgres with uid 50010 (arbitrary id, does not matter but it is better to have the same id on all servers). The user must be created before the rpm is installed.
1.2 Postgres
# as user root
MAJORVER=11
MINORVER=2
yum update -y
# install some useful packages
yum install -y epel-release libxslt sudo openssh-server openssh-clients jq passwd rsync iproute python-setuptools hostname inotify-tools yum-utils which sudo vi firewalld
# create user postgres before installing postgres rpm, because I want to fix the uid
useradd -u 50010 postgres
# set a password
passwd postgres
# So that postgres can become root without password, I add it in sudoers
echo "postgres ALL=(ALL) NOPASSWD:ALL" > /etc/sudoers.d/postgres
# install postgres release rpm, this will add a file /etc/yum.repos.d/pgdg-11-centos.repo
yum install -y https://download.postgresql.org/pub/repos/yum/${MAJORVER}/redhat/rhel-7-x86_64/pgdg-centos${MAJORVER}-${MAJORVER}-${MINORVER}.noarch.rpm
# with the repo added (repo is called pgdg11), ww can install postgres
# NB: postgres is also available from Centos base repos, but we want the latest version (11)
yum install -y postgresql${MAJORVER}-${MAJORVER}.${MINORVER} postgresql${MAJORVER}-server-${MAJORVER}.${MINORVER} postgresql${MAJORVER}-contrib-${MAJORVER}.${MINORVER}
# verify
yum list installed postgresql*
I want to store the postgres database in /u01/pg11/data (non-default location), I want to have the archived wal in /u02/archive and the backup in /u02/backup
mkdir -p /u01/pg${MAJORVER}/data /u02/archive /u02/backup
chown postgres:postgres /u01/pg${MAJORVER}/data /u02/archive /u02/backup
chmod 700 /u01/pg${MAJORVER}/data /u02/archive
The environment variable PGDATA is important: it points to the database location. We need to change it from the rpm default (which is /var/lib/pgsql/11/data) to the new location /u01/pg11/data.
export PGDATA=/u01/pg${MAJORVER}/data
# add the binaries in the path of all users
echo "export PATH=\$PATH:/usr/pgsql-${MAJORVER}/bin" > /etc/profile.d/postgres.sh
# source /etc/profile in bashrc of user postgres, make sure that PGDATA is defined and also PGVER so that we
# can use PGVER in later scripts
echo "[ -f /etc/profile ] && source /etc/profile" >> /home/postgres/.bashrc
echo "export PGDATA=/u01/pg${MAJORVER}/data" >> /home/postgres/.bashrc
echo "export PGVER=${MAJORVER}" >> /home/postgres/.bashrc
We need a systemd unit file for postgres, so that it is started automatically when we boot the server (see https://www.postgresql.org/docs/11/server-start.html)
cat <<EOF > /etc/systemd/system/postgresql.service
[Unit]
Description=PostgreSQL database server
Documentation=man:postgres(1)
[Service]
Type=notify
User=postgres
ExecStart=/usr/pgsql-11/bin/postgres -D /u01/pg11/data
ExecReload=/bin/kill -HUP $MAINPID
KillMode=mixed
KillSignal=SIGINT
TimeoutSec=0
[Install]
WantedBy=multi-user.target
EOF
enable the unit but don’t start it yet
sudo systemctl enable postgresql
Now we can init the database, as user postgres
su - postgres
# check that PGDATA and the PATH are correct
echo $PGDATA
echo $PATH
pg_ctl -D ${PGDATA} initdb -o "--auth=trust --encoding=UTF8 --locale='en_US.UTF8'"
postgres configuration parameters are in $PGDATA/postgres.conf, I prefer not to change this file but include an additional file from the config.d directory and override some of the defaults parameter. I add the line “include_dir=’conf.d’” at the end of postgresql.conf and then I add custom configurations in conf.d
# as user postgres
mkdir $PGDATA/conf.d
echo "include_dir = 'conf.d'" >> $PGDATA/postgresql.conf
# now let's add some config in this conf.d directory
cat <<EOF > $PGDATA/conf.d/custom.conf
log_destination = 'syslog,csvlog'
logging_collector = on
# better to put the logs outside PGDATA so they are not included in the base_backup
log_directory = '/var/log/postgres'
log_filename = 'postgresql-%Y-%m-%d.log'
log_truncate_on_rotation = on
log_rotation_age = 1d
log_rotation_size = 0
# These are relevant when logging to syslog (if wanted, change log_destination to 'csvlog,syslog')
log_min_duration_statement=-1
log_duration = on
log_line_prefix='%m %c %u - '
log_statement = 'all'
log_connections = on
log_disconnections = on
log_checkpoints = on
log_timezone = 'Europe/Brussels'
# up to 30% of RAM. Too high is not good.
shared_buffers = 512MB
#checkpoint at least every 15min
checkpoint_timeout = 15min
#if possible, be more restrictive
listen_addresses='*'
#for standby
max_replication_slots = 5
archive_mode = on
archive_command = '/opt/postgres/scripts/archive.sh %p %f'
# archive_command = '/bin/true'
wal_level = replica
max_wal_senders = 5
hot_standby = on
hot_standby_feedback = on
# for pg_rewind
wal_log_hints=true
EOF
I prefer to put the logs outside PGDATA, otherwise a backup might become very big just because of the logs. So we must create the directory
sudo mkdir /var/log/postgres
sudo chown postgres:postgres /var/log/postgres
As you can see I use the script /opt/postgres/scripts/archive.sh as archive command, this needs to be created
sudo mkdir -p /opt/postgres/scripts
sudo chown -R postgres:postgres /opt/postgres
content of script /opt/postgres/scripts/archive.sh
#!/bin/bash
LOGFILE=/var/log/postgres/archive.log
if [ ! -f $LOGFILE ] ; then
touch $LOGFILE
fi
echo "archiving $1 to /u02/archive/$2"
cp $1 /u02/archive/$2
exit $?
make sure it is executable
chmod +x /opt/postgres/scripts/archive.sh
⚠️: When archive mode is on, you must clean the directory where archived wal’s are copied. Usually via a backup script that takes a backup and then remove archived wal older than the backup
Now let’s create a database and a user
sudo systemctl start postgresql
# verify
sudo systemctl status postgresql
psql -U postgres postgres
at the psql prompt, create a database (my database is called critlib), then quit
create database critlib encoding 'UTF8' LC_COLLATE='en_US.UTF8';
\q
now reconnect to the newly create database and create two users (cl_owner and cl_user)
psql -U postgres critlib
create user cl_owner nosuperuser nocreatedb login password 'cl_owner';
create schema cl_owner authorization cl_owner;
create user cl_user nosuperuser nocreatedb login password 'cl_user';
grant usage on schema cl_owner to cl_user;
alter default privileges in schema cl_owner grant select,insert,update,delete on tables to cl_user;
alter role cl_user set search_path to "$user",cl_owner,public;
\q
Let’s open the firewall port 5432
# as root
firewall-cmd --add-port 5432/tcp --permanent
systemctl restart firewalld
Before cloning the server (we need 3 servers) I will install the packages repmgr and pgpool. After that we can clone the server (it is a VM !), set-up the streaming replication with repmgr and then configure pgpool
1.2 repmgr
Repmgr is a nice open-source tool made by 2ndquadrant. It is not mandatory to use it and at first it maybe worth to set-up the streaming replication using standard postgres commands, for learning purpose. But otherwise it is really worth using repmgr because it brings very well tested and documented scripts. Note that repmgr can also be used to automate the failover (via the repmgrd daemon process) but we don’t want that because we will use pgpool.
MAJORVER=11
REPMGRVER=4.2
curl https://dl.2ndquadrant.com/default/release/get/${MAJORVER}/rpm | bash
yum install -y --enablerepo=2ndquadrant-dl-default-release-pg${MAJORVER} --disablerepo=pgdg${MAJORVER} repmgr${MAJORVER}-${REPMGRVER}
mkdir /var/log/repmgr && chown postgres:postgres /var/log/repmgr
set ownership of /etc/repmgr to postgres
chown -R postgres:postgres /etc/repmgr
1.3 install pgpool
This is just about installing the rpm, the configuration will be done later. The configuration has to be done on each server but we can already install pgpool before cloning the server. Adding pgpool on Centos is easy: install the pgpool release rpm (it will create a repo file in /etc/yum.repo.d) and then install pgpool itself via yum.
export PGPOOLMAJOR=4.0
export PGPOOLVER=4.0.3
export PGVER=11
yum install -y http://www.pgpool.net/yum/rpms/${PGPOOLMAJOR}/redhat/rhel-7-x86_64/pgpool-II-release-${PGPOOLMAJOR}-1.noarch.rpm
yum install --disablerepo=pgdg11 --enablerepo=pgpool40 -y pgpool-II-pg11-${PGPOOLVER} pgpool-II-pg11-extensions-${PGPOOLVER} pgpool-II-pg11-debuginfo-${PGPOOLVER}
``
I prefer to have pgpool running as user postgres, so I will override the systemd unit file that was installed by the rpm
```bash
# as user root
mkdir /etc/systemd/system/pgpool.service.d
cat <<EOF > /etc/systemd/system/pgpool.service.d/override.conf
[Service]
User=postgres
Group=postgres
EOF
Since pgpool connects to the various postgres servers via ssh, I add a few config to the ssh client to make ssh connections easier
cat <<EOF > /etc/ssh/ssh_config
StrictHostKeyChecking no
UserKnownHostsFile /dev/null
EOF
Since I will use postgres user to start pgpool, I will change the ownership of /opt/pgpool-II
sudo chown postgres:postgres -R /etc/pgpool-II
I also need to create a directory for the pid file and the socket file
sudo mkdir /var/run/pgpool
sudo chown postgres:postgres /var/run/pgpool
1.4 Clone the servers
Now that I have pg01 ready, I will clone it to pg02 and pg03
I am using virsh (KVM) to manage my guests, those are the commands I used to clone my VM (the VM was created with this procedure http://saule1508.github.io/libvirt-centos-cloud-image)
Note: this is mostly for my own record, your way of cloning a VM might be different of course or you might have physical servers or you might use ansible to automate the provisioning,…
# on my host
virsh shutdown pg01
sudo mkdir /u01/virt/{pg02,pg03}
sudo chown pierre:pierre /u01/virt/pg02 /u01/virt/pg03
virt-clone -o pg01 -n pg02 --file /u01/virt/pg02/pg02.qcow2 --file /u01/virt/pg02/pg02-disk1.qcow2
virt-clone -o pg01 -n pg03 --file /u01/virt/pg03/pg03.qcow2 --file /u01/virt/pg03/pg03-disk1.qcow2
Now start the VM, get into it and change its IP address. To have a DNS, I also add an entry in /etc/hosts of the KVM host
# /etc/hosts of the KVM server, add one entry per vm so that they can speak to each other via dns
# added
192.168.122.10 pg01.localnet
192.168.122.11 pg02.localnet
192.168.122.13 pg03.localnet
when doing so, one must stop and start the default network then restart libvirtd
virsh net-destroy default
virsh net-start default
sudo systemctl restart libvirtd
Connect to each VM via ssh, change the IP address and the hostname (hostname-ctl set-hostname)
2. Set-up streaming replication with repmgr
Now I have 3 centos 7 VM with postgres, repmgr and pgpool installed. The 3 VM can resolve each other names via DNS but alternatively we could put 3 entries in the /etc/hosts file on each VM
Host | IP |
---|---|
pg01.localnet | 192.168.122.10 |
pg02.localnet | 192.168.122.11 |
pg03.localnet | 192.168.122.12 |
We will use repmgr to set-up the 3 postgres instances in a primary - standby configuration where pg01 will be the primary. pg01 will be streaming his write ahead logs (wal) to pg02 and pg03.
First we want to set-up ssh keys so that each servers can connect to each other with the user postgres. On each server, generate a ssh keys pairt for user postgres and transfer the public key to both other servers.
# on pg01 as user postgres. Keep the default (no passphrase)
ss-keygen -t rsa
ssh-copy-id postgres@pg01.localnet
ssh-copy-id postgres@pg02.localnet
ssh-copy-id postgres@pg03.localnet
From pg01 postgres user must be able to connect to pg02 and pg03 but also to pg01 (itself) via ssh without a password. This is because pgpool - running on those servers - will connect via ssh to any of the postgres server to perform some operation (standby promote, standby follow, node recovery)
do the same on server pg02 and on server pg03
2.1 primary database on pg01
For the streaming replication we will be using the user repmgr with password rep123. Let’s create it.
create the user repmgr
# on pg01 as user postgres
psql <<-EOF
create user repmgr with superuser login password 'rep123' ;
alter user repmgr set search_path to repmgr,"\$user",public;
\q
EOF
create a database called repmgr
# on pg01 as user postgres
psql --command "create database repmgr with owner=repmgr ENCODING='UTF8' LC_COLLATE='en_US.UTF8';"
We want to connect with repmgr without password, that’s what the pgpass hidden file is for.
# on pg01 as user postgres
echo "*:*:repmgr:repmgr:rep123" > /home/postgres/.pgpass
echo "*:*:replication:repmgr:rep123" >> /home/postgres/.pgpass
chmod 600 /home/postgres/.pgpass
scp /home/postgres/.pgpass pg02.localnet:/home/postgres/.pgpass
scp /home/postgres/.pgpass pg03.localnet:/home/postgres/.pgpass
Add entries in $PGDATA/pg_hba.conf for repmgr
# on pg01 as user postgres
cat <<EOF >> $PGDATA/pg_hba.conf
# replication manager
local replication repmgr trust
host replication repmgr 127.0.0.1/32 trust
host replication repmgr 0.0.0.0/0 md5
local repmgr repmgr trust
host repmgr repmgr 127.0.0.1/32 trust
host repmgr repmgr 127.0.0.1/32 trust
host repmgr repmgr 0.0.0.0/0 md5
host all all 0.0.0.0/0 md5
EOF
sudo systemctl restart postgresql
create the repmgr configuration file
# on pg01 as user postgres
cat <<EOF > /etc/repmgr/11/repmgr.conf
node_id=1
node_name=pg01.localnet
conninfo='host=pg01.localnet dbname=repmgr user=repmgr password=rep123 connect_timeout=2'
data_directory='/u01/pg11/data'
use_replication_slots=yes
# event_notification_command='/opt/postgres/scripts/repmgrd_event.sh %n "%e" %s "%t" "%d" %p %c %a'
reconnect_attempts=10
reconnect_interval=1
restore_command = 'cp /u02/archive/%f %p'
log_facility=STDERR
failover=manual
monitor_interval_secs=5
pg_bindir='/usr/pgsql-11/bin'
service_start_command = 'sudo systemctl start postgresql'
service_stop_command = 'sudo systemctl stop postgresql'
service_restart_command = 'sudo systemctl restart postgresql'
service_reload_command = 'pg_ctl reload'
promote_command='repmgr -f /etc/repmgr/11/repmgr.conf standby promote'
follow_command='repmgr -f /etc/repmgr/11/repmgr.conf standby follow -W --upstream-node-id=%n'
EOF
Now register this database as being the primary database
# on pg01 as user postgres
repmgr -f /etc/repmgr/11/repmgr.conf -v master register
repmgr maintains information about the cluster topology in a table called nodes
psql -U repmgr repmgr -c "select * from nodes;"
We can now enable the systemd unit postgresql, so that postgres will be started on next boot
# on pg01
sudo systemctl enable postgresql
2.2. Standby database on pg02
Now we can go to the first standby, pg02, and set-it up as a standby
# on pg02 as user postgres
cat <<EOF > /etc/repmgr/11/repmgr.conf
node_id=2
node_name=pg02.localnet
conninfo='host=pg02.localnet dbname=repmgr user=repmgr password=rep123 connect_timeout=2'
data_directory='/u01/pg11/data'
use_replication_slots=yes
# event_notification_command='/opt/postgres/scripts/repmgrd_event.sh %n "%e" %s "%t" "%d" %p %c %a'
reconnect_attempts=10
reconnect_interval=1
restore_command = 'cp /u02/archive/%f %p'
log_facility=STDERR
failover=manual
monitor_interval_secs=5
pg_bindir='/usr/pgsql-11/bin'
service_start_command = 'sudo systemctl start postgresql'
service_stop_command = 'sudo systemctl stop postgresql'
service_restart_command = 'sudo systemctl restart postgresql'
service_reload_command = 'pg_ctl reload'
promote_command='repmgr -f /etc/repmgr/11/repmgr.conf standby promote'
follow_command='repmgr -f /etc/repmgr/11/repmgr.conf standby follow -W --upstream-node-id=%n'
EOF
let’s wipe out the database and the archived wal
# make sure postgres is not running !
sudo systemctl stop postgresql
# on pg02 as user postgres
rm -rf /u01/pg11/data/*
rm -rf /u02/archive/*
And now let’s set-up this server as a standby
repmgr -h pg01.localnet -U repmgr -d repmgr -D /u01/pg11/data -f /etc/repmgr/11/repmgr.conf standby clone
sudo systemctl start postgresql
repmgr -f /etc/repmgr/11/repmgr.conf standby register --force
if it does not work, check that you can connect to pg01.localnet (maybe you forgot the firewall ?): psql -h pg01.localnet -U repmgr repmgr should work
we can enable the unit postgresql so that postgres will start automatically on next reboot
sudo systemctl enable postgresql
The nodes table should now have two records
psql -U repmgr -c "select * from nodes;"
2.3. Standby database on pg03
Let us do the same on the third standby, pg03, and set-it up as a standby
# on pg03 as user postgres
cat <<EOF > /etc/repmgr/11/repmgr.conf
node_id=3
node_name=pg03.localnet
conninfo='host=pg03.localnet dbname=repmgr user=repmgr password=rep123 connect_timeout=2'
data_directory='/u01/pg11/data'
use_replication_slots=yes
# event_notification_command='/opt/postgres/scripts/repmgrd_event.sh %n "%e" %s "%t" "%d" %p %c %a'
reconnect_attempts=10
reconnect_interval=1
restore_command = 'cp /u02/archive/%f %p'
log_facility=STDERR
failover=manual
monitor_interval_secs=5
pg_bindir='/usr/pgsql-11/bin'
service_start_command = 'sudo systemctl start postgresql'
service_stop_command = 'sudo systemctl stop postgresql'
service_restart_command = 'sudo systemctl restart postgresql'
service_reload_command = 'pg_ctl reload'
promote_command='repmgr -f /etc/repmgr/11/repmgr.conf standby promote'
follow_command='repmgr -f /etc/repmgr/11/repmgr.conf standby follow -W --upstream-node-id=%n'
EOF
let’s wipe out the database and the archived wal
# make sure postgres is not running !
sudo systemctl stop postgresql
# on pg03 as user postgres
rm -rf /u01/pg11/data/*
rm -rf /u02/archive/*
And now let’s set-up this server as a standby
repmgr -h pg01.localnet -U repmgr -d repmgr -D /u01/pg11/data -f /etc/repmgr/11/repmgr.conf standby clone
sudo systemctl start postgresql
repmgr -f /etc/repmgr/11/repmgr.conf standby register --force
we can enable the unit postgresql so that postgres will start automatically on next reboot
sudo systemctl enable postgresql
Check the repmgr tables nodes
psql -U repmgr repmgr -c "select * from nodes;"
2.4. Test streaming replication
# on pg01 as user postgres
# check the nodes table, it contains meta-data for repmgr
psql -U repmgr repmgr -c "select * from nodes;"
# create a test table
psql -U repmgr repmgr -c "create table test(c1 int, message varchar(120)); insert into test values(1,'this is a test');"
# check on pg02 if I can see the table and if it is read-only
psql -h pg02.localnet -U repmgr repmgr <<EOF
select * from test;
drop table test;
EOF
# it will say: ERROR: cannot execute DROP TABLE in a read-only transaction
# check on pg03 if I can see the table and if it is read-only
psql -h pg03.localnet -U repmgr repmgr <<EOF
select * from test;
drop table test;
EOF
# drop the table on pg01
psql -U repmgr repmgr -c "drop table test;"
Experiment a bit with basis operations: failover, switch-over, following a new master, rejoining a failed primary, etc. repmgr has scripts for all of that.
3. pgpool
Now it is time to configure pgpool on our 3 servers. pgpool can be used as a connection cache but - more importantly - it is used to automate the failover of postgres and to make the failover (almost) transparent to client applications.
Remember that postgres is made HA (High Available) via the streaming replication and that pgpool is made HA via a Virtual IP (VIP) managed by the watchdog mode.
The following line muse be added to the postgres config (not sure why and if it is still needed)
echo "pgpool.pg_ctl='/usr/pgsql-11/bin/pg_ctl'" >> $PGDATA/conf.d/custom.conf
3.1. pgool_hba and pool_passwd files
similar to the postgres host based authentification mechanism (pg_hba), pgpool has a pool_hba.conf file.
cat <<EOF > /etc/pgpool-II/pool_hba.conf
local all all trust
# IPv4 local connections:
host all all 0.0.0.0/0 md5
EOF
scp /etc/pgpool-II/pool_hba.conf postgres@pg02.localnet:/etc/pgpool-II/
scp /etc/pgpool-II/pool_hba.conf postgres@pg03.localnet:/etc/pgpool-II/
Because I use md5, I will need to have the file pool_passwd containing the md5 hashed password http://www.pgpool.net/docs/latest/en/html/runtime-config-connection.html#GUC-POOL-PASSWD
# first dump the info into a temp file
psql -c "select rolname,rolpassword from pg_authid;" > /tmp/users.tmp
touch /etc/pgpool-II/pool_passwd
# then go through the file to remove/add the entry in pool_passwd file
cat /tmp/users.tmp | awk 'BEGIN {FS="|"}{print $1" "$2}' | grep md5 | while read f1 f2
do
echo "setting passwd of $f1 in /etc/pgpool-II/pool_passwd"
# delete the line if exits
sed -i -e "/^${f1}:/d" /etc/pgpool-II/pool_passwd
echo $f1:$f2 >> /etc/pgpool-II/pool_passwd
done
scp /etc/pgpool-II/pool_passwd pg02:/etc/pgpool-II/pool_passwd
scp /etc/pgpool-II/pool_passwd pg03:/etc/pgpool-II/pool_passwd
⚠️ Everytime a new postgres user is added or her password is changed this small procedure must be re-ran
3.2. Setting up pcp.conf file
pcp (pgpool control protocol) is an administrative interface to pgpool, it enables you to interact with pgpool via port 9898 (default). The file pcp.conf stores a username and a md5 password to authentificate, the file .pcppass enables a user to use pcp commands without password. Note that this user/password is not related to a postgres user, it is just a pcp user that can speak with pgpool via the pcp protocol. http://www.pgpool.net/docs/latest/en/html/configuring-pcp-conf.html
I will use the user postgres with password secret
# user postgres on pg01
echo "postgres:$(pg_md5 secret)" >> /etc/pgpool-II/pcp.conf
scp /etc/pgpool-II/pcp.conf pg02:/etc/pgpool-II/pcp.conf
scp /etc/pgpool-II/pcp.conf pg03:/etc/pgpool-II/pcp.conf
echo "*:*:postgres:secret" > /home/postgres/.pcppass
chown postgres:postgres /home/postgres/.pcppass
chmod 600 /home/postgres/.pcppass
scp /home/postgres/.pcppass pg02:/home/postgres/.pcppass
scp /home/postgres/.pcppass pg03:/home/postgres/.pcppass
because of this .pcppass we will be able to use pcp without password when logged in as unix user postgres.
In watchdog mode, pgpool will need to execute the ip and the arping commands with user postgres, so we set the sticky bit on those two utilities.
chmod 4755 /usr/sbin/ip /usr/sbin/arping
3.3. Pgpool configuration file: pgpool.conf
The 3 servers used in the config are pg01.localnet, pg02.localnet and pg03.localnet. We also need a VIP (virtual IP, also called delegate_ip in the context of pgpool). In my case the VIP will be 192.168.122.50.
This is the config I use for the first server, I will explain it below. Take care that because I do a cat redirected to the config, I have to escape the dollar signs in the config. Pay attention to if_up_cmd below for example, I escaped the $ but in the file it must of course not be escaped.
# as user postgres on pg01
CONFIG_FILE=/etc/pgpool-II/pgpool.conf
cat <<EOF > $CONFIG_FILE
listen_addresses = '*'
port = 9999
socket_dir = '/var/run/pgpool'
pcp_listen_addresses = '*'
pcp_port = 9898
pcp_socket_dir = '/var/run/pgpool'
listen_backlog_multiplier = 2
serialize_accept = off
enable_pool_hba = on
pool_passwd = 'pool_passwd'
authentication_timeout = 60
ssl = off
num_init_children = 100
max_pool = 5
# - Life time -
child_life_time = 300
child_max_connections = 0
connection_life_time = 600
client_idle_limit = 0
log_destination='stderr'
debug_level = 0
pid_file_name = '/var/run/pgpool/pgpool.pid'
logdir = '/tmp'
connection_cache = on
reset_query_list = 'ABORT; DISCARD ALL'
#reset_query_list = 'ABORT; RESET ALL; SET SESSION AUTHORIZATION DEFAULT'
replication_mode = off
load_balance_mode = off
master_slave_mode = on
master_slave_sub_mode = 'stream'
backend_hostname0 = 'pg01.localnet'
backend_port0 = 5432
backend_weight0 = 1
backend_data_directory0 = '/u01/pg11/data'
backend_flag0 = 'ALLOW_TO_FAILOVER'
backend_hostname1 = 'pg02.localnet'
backend_port1 = 5432
backend_weight1 = 1
backend_data_directory1 = '/u01/pg11/data'
backend_flag1 = 'ALLOW_TO_FAILOVER'
backend_hostname2 = 'pg03.localnet'
backend_port2 = 5432
backend_weight2 = 1
backend_data_directory2 = '/u01/pg11/data'
backend_flag2 = 'ALLOW_TO_FAILOVER'
# this is about checking the postgres streaming replication
sr_check_period = 10
sr_check_user = 'repmgr'
sr_check_password = 'rep123'
sr_check_database = 'repmgr'
delay_threshold = 10000000
# this is about automatic failover
failover_command = '/opt/pgpool/scripts/failover.sh %d %h %P %m %H %R'
# not used, just echo something
failback_command = 'echo failback %d %h %p %D %m %H %M %P'
failover_on_backend_error = 'off'
search_primary_node_timeout = 300
# Mandatory in a 3 nodes set-up
follow_master_command = '/opt/pgpool/scripts/follow_master.sh %d %h %m %p %H %M %P'
# grace period before triggering a failover
health_check_period = 40
health_check_timeout = 10
health_check_user = 'hcuser'
health_check_password = 'hcuser'
health_check_database = 'postgres'
health_check_max_retries = 3
health_check_retry_delay = 1
connect_timeout = 10000
#------------------------------------------------------------------------------
# ONLINE RECOVERY
#------------------------------------------------------------------------------
recovery_user = 'repmgr'
recovery_password = 'rep123'
recovery_1st_stage_command = 'pgpool_recovery.sh'
recovery_2nd_stage_command = 'echo recovery_2nd_stage_command'
recovery_timeout = 90
client_idle_limit_in_recovery = 0
#------------------------------------------------------------------------------
# WATCHDOG
#------------------------------------------------------------------------------
use_watchdog = on
# trusted_servers = 'www.google.com,pg02.localnet,pg03.localnet' (not needed with a 3 nodes cluster ?)
ping_path = '/bin'
wd_hostname = pg01.localnet
wd_port = 9000
wd_priority = 1
wd_authkey = ''
wd_ipc_socket_dir = '/var/run/pgpool'
delegate_IP = '192.168.122.99'
if_cmd_path = '/opt/pgpool/scripts'
if_up_cmd = 'ip_w.sh addr add \$_IP_\$/24 dev eth0 label eth0:0'
if_down_cmd = 'ip_w.sh addr del \$_IP_\$/24 dev eth0'
arping_path = '/opt/pgpool/scripts'
arping_cmd = 'arping_w.sh -U \$_IP_\$ -I eth0 -w 1'
# - Behaivor on escalation Setting -
heartbeat_destination0 = 'pg01.localnet'
heartbeat_destination_port0 = 9694
heartbeat_destination1 = 'pg02.localnet'
heartbeat_destination_port1 = 9694
heartbeat_destination2 = 'pg03.localnet'
heartbeat_destination_port2 = 9694
other_pgpool_hostname0 = 'pg02.localnet'
other_pgpool_port0 = 9999
other_wd_port0 = 9000
other_pgpool_hostname1 = 'pg03.localnet'
other_pgpool_port1 = 9999
other_wd_port1 = 9000
EOF
Let’s copy the file to pg02.localnet and pg03.localnet and make the necessary adaptations in the sections
scp /etc/pgpool-II/pgpool.conf pg02.localnet:/etc/pgpool-II/pgpool.conf
on pg02.localnet, changes
wd_hostname = pg02.localnet
other_pgpool_hostname0 = 'pg01.localnet'
other_pgpool_port0 = 9999
other_wd_port0 = 9000
other_pgpool_hostname1 = 'pg03.localnet'
other_pgpool_port1 = 9999
other_wd_port1 = 9000
And the same to pg03.localnet
⚠️ when there is an issue with the watchdog of pgpool the reason is very often that the config is not correct, for example one forgot to adapt the other_pgpool_hostname section on each node.
3.4. Some pgpool parameters explained:
- params related to streaming replication.
I prefer to set load_balance to ‘off’ but it might be a useful feature in some cases. By setting master_slave_mode on and sub mode to stream we are telling pgpool that we use standard postgres streaming replication mechanism.
load_balance_mode = off
master_slave_mode = on
master_slave_sub_mode = 'stream'
The backend section describes the 3 postgres
backend_hostname0 = 'pg01.localnet'
backend_port0 = 5432
backend_weight0 = 1
backend_data_directory0 = '/u01/pg11/data'
backend_flag0 = 'ALLOW_TO_FAILOVER'
etc... for pg02 and pg03
- num_init_children and max_pool relate to the connection caching functionality of pgpool.
When pgpool starts it will create “num_init_children” client processes that will all be in a state “waiting for a connection”. When a client application connects to pgpool, pgpool will use one of the child process that is free and establish a connection to postgres. When the client disconnects then pgpool will keep the connection to postgres open (connection caching) so that next time a client connect to pgpool the connection to postgres can be re-used and we avoid the cost of creating a new connection. We can disable this connection caching if we prefer to have pooling at the application level.
- parameters related to automatic failover
failover_command is the script that will be executed when pgpool detects (via health check mechanism in my case) that the primary database is down. I don’t know what the failback_command is supposed to do, I just don’t use it. The follow_master_command is executed on all standby nodes after the failover_command was executed and a new primary database has been detected. I wil show the corresponding scripts later in this document.
failover_command = '/opt/pgpool/scripts/failover.sh %d %h %P %m %H %R'
failback_command = 'echo failback %d %h %p %D %m %H %M %P'
search_primary_node_timeout = 300
follow_master_command = '/opt/pgpool/scripts/follow_master.sh %d %h %m %p %H %M %P'
The first reason why people need pgpool is because pgpool can automate the failover and make it transparent to client applications. There are basically two mechanisms to trigger a failover. Either via failover_on_backend_error or via the health checks. I prefer to set failover_on_backend_error to off and to use the health check mechanism to trigger the failover. The health check mechanism means that pgpool will regularly connect to postgres, if a connection fails (either to a primary or to a standby) then after the number of retries it will trigger the failover.
failover_on_backend_error = 'off'
health_check_period = 40
health_check_timeout = 10
health_check_user = 'hcuser'
health_check_password = 'hcuser'
health_check_database = 'postgres'
health_check_max_retries = 3
health_check_retry_delay = 1
If you set failover_on_backend_error to ‘on’, then pgpool will trigger the failover when one of the child process detects that postgres is terminated. In this case there is no retry. Note also that in this case, as long as no application connect to pgpool or perform a query on the database, no failover will be triggered.
Do not confuse the health check with the streaming replication check (the streaming replication check look at the delay in replication, if the delay is too big a standby will ne be used for read load balancing and will not be considered for promotion). For some strange reason pgpool uses two different users for these.
- parameters related to watch-dog
The watch-dog mechanism is a very cool functionality of pgpool: multiple instances of pgpool (3 in our case) are in an active-passive set-up, on startup the cluster elects one leader and the leader acquires the VIP (delegate_ip). The 3 nodes are monitoring each other and if the primary fails then a new leader will be elected and the VIP will be moved to this new leader.
Only the leader pgpool node will execute the failover, follow_master, etc. commands
The watchdog mode must be enabled via use_watchdog (‘on’)
The nodes monitor each other via the heartbeat mechanism, sending a packet on UDP port 9694 to the other nodes
heartbeat_destination0 = 'pg01.localnet'
heartbeat_destination_port0 = 9694
heartbeat_destination1 = 'pg02.localnet'
heartbeat_destination_port1 = 9694
heartbeat_destination2 = 'pg03.localnet'
heartbeat_destination_port2 = 9694
The VIP is acquired via a script that must be installed on the host.
if_cmd_path = '/opt/pgpool/scripts'
if_up_cmd = 'ip_w.sh addr add $_IP_$/24 dev eth0 label eth0:0'
if_down_cmd = 'ip_w.sh addr del $_IP_$/24 dev eth0'
arping_path = '/opt/pgpool/scripts'
arping_cmd = 'arping_w.sh -U $_IP_$ -I eth0 -w 1'
In the command above, pgpool will replace $IP$ with the value of the delegate_ip parameter
delegate_ip = 192.168.122.99
each node must be aware of the others, so on pg01
wd_hostname = pg01
wd_port = 9000
wd_priority = 1
wd_authkey = ''
wd_ipc_socket_dir = '/var/run/pgpool'
other_pgpool_hostname0 = 'pg02.localnet'
other_pgpool_port0 = 9999
other_wd_port0 = 9000
other_pgpool_hostname1 = 'pg03.localnet'
other_pgpool_port1 = 9999
other_wd_port1 = 9000
- log destination
I use the default log_destination = ‘stderr’ and I look at the logs via journalctl
sudo journalctl --unit pgpool
3.5. pgpool preparation
we need to create the postgres user for the health check
# on pg01 as user postgres
psql -c "create user hcuser with login password 'hcuser';"
We need to open a few firewall ports, for pgpool (9999 and 9898 for pcp) and for the watch-dog functionality
# on all 3 servers
# port for postgres (was already done)
sudo firewall-cmd --add-port 9898/tcp --permanent
# ports for pgpool and for pcp
sudo firewall-cmd --add-port 9898/tcp --permanent
sudo firewall-cmd --add-port 9999/tcp --permanent
sudo firewall-cmd --add-port 9000/tcp --permanent
sudo firewall-cmd --add-port 9694/udp --permanent
sudo systemctl restart firewalld
Directory for pid file
# on all servers
sudo chown postgres:postgres /var/run/pgpool
sudo mkdir /var/log/pgpool
sudo chown postgres:postgres /var/log/pgpool
pgpool scripts will logs into /var/log/pgpool.
If you want also logs of pgpool itself, then use as log_destination = ‘stderr,syslog’ and set syslog_facility to LOCAL1, and configure syslog. Add to /etc/rsyslog.conf
local1.* /var/log/pgpool/pgpool.log
then restart the daemon
sudo systemctl restart rsyslog
nb: not sure this is working though, not tested.
3.5. Pgpool Scripts
sudo mkdir -p /opt/pgpool/scripts
sudo chown postgres:postgres /opt/pgpool/scripts
- failover script
Copy and paste the following in /opt/pgpoo/scripts/failover.sh
#!/bin/bash
LOGFILE=/var/log/pgpool/failover.log
if [ ! -f $LOGFILE ] ; then
> $LOGFILE
fi
# we need this, otherwise it is not set
PGVER=${PGVER:-11}
#
#failover_command = '/scripts/failover.sh %d %h %P %m %H %R'
# Executes this command at failover
# Special values:
# %d = node id
# %h = host name
# %p = port number
# %D = database cluster path
# %m = new master node id
# %H = hostname of the new master node
# %M = old master node id
# %P = old primary node id
#
FALLING_NODE=$1 # %d
FALLING_HOST=$2 # %h
OLD_PRIMARY_ID=$3 # %P
NEW_PRIMARY_ID=$4 # %m
NEW_PRIMARY_HOST=$5 # %H
NEW_MASTER_PGDATA=$6 # %R
(
date
echo "FALLING_NODE: $FALLING_NODE"
echo "FALLING_HOST: $FALLING_HOST"
echo "OLD_PRIMARY_ID: $OLD_PRIMARY_ID"
echo "NEW_PRIMARY_ID: $NEW_PRIMARY_ID"
echo "NEW_PRIMARY_HOST: $NEW_PRIMARY_HOST"
echo "NEW_MASTER_PGDATA: $NEW_MASTER_PGDATA"
ssh_options="ssh -p 22 -n -T -o UserKnownHostsFile=/dev/null -o StrictHostKeyChecking=no"
set -x
if [ $FALLING_NODE = $OLD_PRIMARY_ID ] ; then
$ssh_options postgres@${NEW_PRIMARY_HOST} "/usr/pgsql-${PGVER}/bin/repmgr --log-to-file -f /etc/repmgr/${PGVER}/repmgr.conf standby promote -v "
fi
exit 0;
make it executable
- follow_master
Copy the following in /opt/pgpool/scripts/follow_master.sh
#!/bin/bash
LOGFILE=/var/log/pgpool/follow_master.log
if [ ! -f $LOGFILE ] ; then
> $LOGFILE
fi
PGVER=${PGVER:-11}
echo "executing follow_master.sh at `date`" | tee -a $LOGFILE
NODEID=$1
HOSTNAME=$2
NEW_MASTER_ID=$3
PORT_NUMBER=$4
NEW_MASTER_HOST=$5
OLD_MASTER_ID=$6
OLD_PRIMARY_ID=$7
PGDATA=${PGDATA:-/u01/pg${PGVER}/data}
(
echo NODEID=${NODEID}
echo HOSTNAME=${HOSTNAME}
echo NEW_MASTER_ID=${NEW_MASTER_ID}
echo PORT_NUMBER=${PORT_NUMBER}
echo NEW_MASTER_HOST=${NEW_MASTER_HOST}
echo OLD_MASTER_ID=${OLD_MASTER_ID}
echo OLD_PRIMARY_ID=${OLD_PRIMARY_ID}
echo PGDATA=${PGDATA}
if [ $NODEID -eq $OLD_PRIMARY_ID ] ; then
echo "Do nothing as this is the failed master. We could prevent failed master to restart here, so that we can investigate the issue" | tee -a $LOGFILE
else
ssh_options="ssh -p 22 -n -T -o UserKnownHostsFile=/dev/null -o StrictHostKeyChecking=no"
set -x
# if this node is not currently standby then it might be an old master that went back up after a failover occured
# if this is the case we cannot do the follow master command on this node, we should leave it alone
in_reco=$( $ssh_options postgres@${HOSTNAME} 'psql -t -c "select pg_is_in_recovery();"' | head -1 | awk '{print $1}' )
echo "pg_is_in_recovery on $HOSTNAME is $in_reco " | tee -a $LOGFILE
if [ "a${in_reco}" != "at" ] ; then
echo "node $HOSTNAME is not in recovery, probably a degenerated master, skip it" | tee -a $LOGFILE
exit 0
fi
$ssh_options postgres@${HOSTNAME} "/usr/pgsql-${PGVER}/bin/repmgr --log-to-file -f /etc/repmgr/${PGVER}/repmgr.conf -h ${NEW_MASTER_HOST} -D ${PGDATA} -U repmgr -d repmgr standby follow -v "
# TODO: we should check if the standby follow worked or not, if not we should then do a standby clone command
echo "Sleep 10"
sleep 10
echo "Attach node ${NODEID}"
pcp_attach_node -h localhost -p 9898 -w ${NODEID}
fi
) 2>&1 | tee -a $LOGFILE
- pgpool_recovery script
the script pgpool_recovery.sh must be installed in $PGDATA, must be owned by postgres and have execute permission. This script will be called via the postgres extension pgpool_recovery, so we must first install this extension.
# on the primary db only
psql -c "create extension pgpool_recovery;" -d template1
psql -c "create extension pgpool_adm;"
pgpool_adm is another extension that you may want to install
create the file $PGDATA/pgpool_recovery.sh (in my case $PGDATA is /u01/pg11/data). I also install this script in /opt/pgpool/scripts. to keep a copy.
#!/bin/bash
# This script erase an existing replica and re-base it based on
# the current primary node. Parameters are position-based and include:
#
# 1 - Path to primary database directory.
# 2 - Host name of new node.
# 3 - Path to replica database directory
#
# Be sure to set up public SSH keys and authorized_keys files.
# this script must be in PGDATA
PGVER=${PGVER:-11}
ARCHIVE_DIR=/u02/archive
LOGFILE=/var/log/pgool/pgpool_recovery.log
if [ ! -f $LOGFILE ] ; then
touch $LOGFILE
fi
log_info(){
echo $( date +"%Y-%m-%d %H:%M:%S.%6N" ) - INFO - $1 | tee -a $LOGFILE
}
log_error(){
echo $( date +"%Y-%m-%d %H:%M:%S.%6N" ) - ERROR - $1 | tee -a $LOGFILE
}
log_info "executing pgpool_recovery at `date` on `hostname`"
PATH=$PATH:/usr/pgsql-${PGVER}/bin
if [ $# -lt 3 ]; then
echo "Create a replica PostgreSQL from the primary within pgpool."
echo
echo "Usage: $0 PRIMARY_PATH HOST_NAME COPY_PATH"
echo
exit 1
fi
# to do is hostname -i always OK ? Find other way to extract the host. maybe from repmgr.conf ?
primary_host=$(hostname -i) # not working on SCM
replica_host=$2
replica_path=$3
log_info "primary_host: ${primary_host}"
log_info "replica_host: ${replica_host}"
log_info "replica_path: ${replica_path}"
ssh_copy="ssh -p 22 postgres@$replica_host -T -n -o UserKnownHostsFile=/dev/null -o StrictHostKeyChecking=no"
log_info "Stopping postgres on ${replica_host}"
$ssh_copy "sudo systemctl stop postgresql"
log_info sleeping 10
sleep 10
log_info "delete database directory on ${replica_host}"
$ssh_copy "rm -Rf $replica_path/* $ARCHIVE_DIR/*"
log_info "let us use repmgr on the replica host to force it to sync again"
$ssh_copy "/usr/pgsql-${PGVER}/bin/repmgr -h ${primary_host} --username=repmgr -d repmgr -D ${replica_path} -f /etc/repmgr/${PGVER}/repmgr.conf standby clone -v"
log_info "Start database on ${replica_host} "
$ssh_copy "sudo systemctl start postgresql"
log_info sleeping 20
sleep 20
log_info "Register standby database"
$ssh_copy "/usr/pgsql-${PGVER}/bin/repmgr -f /etc/repmgr/${PGVER}/repmgr.conf standby register -F -v"
Do not forget to install it in $PGDATA and to set execute permission
chmod 774 /opt/pgpool/scripts/pgpool_recovery.sh
cp /opt/pgpool/scripts/pgpool_recovery.sh $PGDATA/pgpool_recovery.sh
scp /opt/pgpool/scripts/pgpool_recovery.sh pg02.localnet:$PGDATA/
scp /opt/pgpool/scripts/pgpool_recovery.sh pg03.localnet:$PGDATA/
for the watchdog mode, we need two additional scripts
- ip scripts
Create the file /opt/pgool/scripts/ip_w.sh
#!/bin/bash
echo "Exec ip with params $@ at `date`"
sudo /usr/sbin/ip $@
exit $?
Create the file /opt/pgpool/scripts/arping_w.sh
#!/bin/bash
echo "Exec arping with params $@ at `date`"
sudo /usr/sbin/arping $@
exit $?
Now set execute mode on all scripts and copy them over to pg02 and pg03
chmod 774 /opt/pgpool/scripts/*
scp /opt/pgpool/scripts/* pg02.localnet:/opt/pgpool/scripts/
scp /opt/pgpool/scripts/* pg03.localnet:/opt/pgpool/scripts/
4. Test the set-up
Before starting pgpool on nodes pg01.localnet, pg02.localnet and pg03.localnet let’s do final checks
- check streaming replication
# on pg01
repmgr -f /etc/repmgr/11/repmgr.conf cluster show
- enable postgressql and pgpool services on all servers
sudo systemctl enable postgresql
sudo systemctl enable pgpool
- check firewall on all servers
sudo firewall-cmd --list-all
- check ssh connectivity between all servers
remember that pg01 will connect to itself via ssh also, and pg02/3 also
start pgpool on pg01.localnet, pg02.localnet and pg03.localnet
sudo systemctl start pgpool
as soon as there is a quorum (two nodes), pgpool will elect a leader and it will acquire the VIP
Since pgpool is logging to stderr, we can see the logs with journalctl
sudo journalctl --unit pgpool -f
Have a good look at the logs, it is a bit verbose but very relevant. Once you start pgpool on pg02.localnet and pg03.localnet, you should see in the log that a master was elected and that the VIP was acquired. The VIP will be on the master node, in my case pg01.localnet (it started first)
get info about the pgpool cluster
pcp_watchdog_info -h 192.168.122.99 -p 9898 -w -v
the -w flag is to avoid password prompt, it requires to have the pcp password for postgres in /home/postgres/.pcppass (the password md5 is defined in /etc/pgpool-II/pcp.conf)
# result of ip addr
[postgres@pg01 pgpool-II]$ ip addr
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
inet 127.0.0.1/8 scope host lo
valid_lft forever preferred_lft forever
inet6 ::1/128 scope host
valid_lft forever preferred_lft forever
2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP group default qlen 1000
link/ether 52:54:00:b5:58:ec brd ff:ff:ff:ff:ff:ff
inet 192.168.122.10/24 brd 192.168.122.255 scope global eth0
valid_lft forever preferred_lft forever
inet 192.168.122.99/24 scope global secondary eth0:0
valid_lft forever preferred_lft forever
inet6 fe80::5054:ff:feb5:58ec/64 scope link
valid_lft forever preferred_lft forever
the important info is that IP 192.168.122.99 was set-up on eth0 with label eth0:0
4.1. failover pgpool
before testing the failover, make sure that all nodes can connect to each other via ssh without a password prompt
# on pg01, pg02 and pg03
ssh -p 22 postgres@pg01.localnet -T -n -o UserKnownHostsFile=/dev/null -o StrictHostKeyChecking=no "repmgr -f /etc/repmgr/${PGVER}/repmgr.conf --help"
ssh -p 22 postgres@pg02.localnet -T -n -o UserKnownHostsFile=/dev/null -o StrictHostKeyChecking=no "repmgr -f /etc/repmgr/${PGVER}/repmgr.conf --help"
ssh -p 22 postgres@pg03.localnet -T -n -o UserKnownHostsFile=/dev/null -o StrictHostKeyChecking=no "repmgr -f /etc/repmgr/${PGVER}/repmgr.conf --help"
to test a failover of pgpool we need to stop pgpool on the master and see if a new master is elected and if the VIP is moved to the new master. Follow the logs during this operation because it gives all relevant information.
sudo systemctl stop pgpool
Check that the VIP was moved.
restart pgpool
sudo systemctl start pgpool
4.2. failover postgres
Let’s see the state of pgpool
psql -h 192.168.122.99 -p 9999 -U repmgr -c "show pool_nodes;"
let’s see also the replication stats on the primary and on the standby’s
# pg01 is the primary
psql -h pg01.localnet -p 5432 -U repmgr -c "select * from pg_stat_replication;"
psql -h pg02.localnet -p 5432 -U repmgr -c "select * from pg_stat_wal_receiver;"
psql -h pg03.localnet -p 5432 -U repmgr -c "select * from pg_stat_wal_receiver;"
since pg01.localnet is the primary, I will stop it and pgpool should promote pg02 to primary (failover script) and pg03 should now follow the new primary.
# on pg01
sudo systemctl stop postgresql
Check again pgpool status
psql -h 192.168.122.99 -p 9999 -U repmgr -c "show pool_nodes;"
check the log files failover.log and follow_master.log in /var/log/pgool on the pgpool master node
check with select * from pg_stat_wal_receiver; that the second stand-by follows the new primary and check with select * from pg_stat_replication; that the primary streams to the standby
4.3. recover failed node
There are different ways to rebase the failed primary as a standby
- using repmgr
assuming the new primary is pg02.localnet, executing the following so that pg01.localnet rejoin the cluster as a standby. Note that it will not always work
repmgr node rejoin -d 'host=pg02.localnet dbname=repmgr user=repmgr password=rep123 connect_timeout=2'
and then re-attach the node to pgpool
pcp_attach_node -h 192.168.122.99 -p 9898 -w 0
But the safest way is to rebuild it, using pcp_recovery_node. pgpool will use the parameters recovery_user and recovery_password to connect to the database and execute “select pg_recovery(…)”. The function pgpool_recovery was created when we installed the pgpool extension.
In our case we had
recovery_user = 'repmgr'
recovery_password = 'rep123'
pcp_recovery_node -h 192.168.122.99 -p 9898 -w 0
and then re-attach the node
pcp_attach_node -h 192.168.122.99 -p 9898 -w 0
4.4. Automatic recovery of a failed master or standby
In my company they wanted that a failed master (or a failed standby) that comes back to life would be automatically put back in the cluster as a standby. For example assuming this scenario:
- pg01 is the primary
- the server pg01 is powered off
- pg02 becomes the primary, pg03 follows pg02
- the server pg01 is powered on
- postgres on pg01 comes back, it is in read-write mode, however because it is detached from pgpool we are protected from a split-brain
we want that pg01 is automatically recoverd and becomes a standby of pg02 and rejoin the pgpool cluster
Another scenario is when a standby server is rebooted: pgpool will detach the standby database but when the server comes back pgpool will not re-attach it.
My solution was to schedule a script via cron. The script is the following. Take care that it must be adapted on pg02 and on pg03, the variable PG_NODE_ID is 0 on pg01, 1 on pg02 and 2 on pg03. Adapt also the var DELEGATE_IP (192.168.122.99 for me)
# content of /opt/postgres/scripts/recover_failed_node.sh (can be scheduled via cron)
#!/bin/bash
# this script will re-attach a failed standby database
# or recover a failed primary database
# it requires that pgpool is available and that the database on this node is running
# this script might be called when the postgres container is starting but then it must do so
# when both pgpool and the database is running. Since the db is started with supervisor, this would
# require to lauch the script in the background after the start of postgres
# the script can also be started manually or via cron
# Created by argbash-init v2.6.1
# ARG_OPTIONAL_BOOLEAN([auto-recover-standby],[],[reattach a standby to pgpool if possible],[on])
# ARG_OPTIONAL_BOOLEAN([auto-recover-primary],[],[recover the degenerated master],[off])
# ARG_OPTIONAL_SINGLE([lock-timeout-minutes],[],[minutes after which a lock will be ignored (optional)],[120])
# ARG_HELP([<The general help message of my script>])
# ARGBASH_GO()
# needed because of Argbash --> m4_ignore([
### START OF CODE GENERATED BY Argbash v2.6.1 one line above ###
# Argbash is a bash code generator used to get arguments parsing right.
# Argbash is FREE SOFTWARE, see https://argbash.io for more info
die()
{
local _ret=$2
test -n "$_ret" || _ret=1
test "$_PRINT_HELP" = yes && print_help >&2
echo "$1" >&2
exit ${_ret}
}
begins_with_short_option()
{
local first_option all_short_options
all_short_options='h'
first_option="${1:0:1}"
test "$all_short_options" = "${all_short_options/$first_option/}" && return 1 || return 0
}
# THE DEFAULTS INITIALIZATION - OPTIONALS
_arg_auto_recover_standby="on"
_arg_auto_recover_primary="off"
_arg_lock_timeout_minutes="120"
print_help ()
{
printf '%s\n' "<The general help message of my script>"
printf 'Usage: %s [--(no-)auto-recover-standby] [--(no-)auto-recover-primary] [--lock-timeout-minutes <arg>] [-h|--help]\n' "$0"
printf '\t%s\n' "--auto-recover-standby,--no-auto-recover-standby: reattach a standby to pgpool if possible (on by default)"
printf '\t%s\n' "--auto-recover-primary,--no-auto-recover-primary: recover the degenerated master (off by default)"
printf '\t%s\n' "--lock-timeout-minutes: minutes after which a lock will be ignored (optional) (default: '120')"
printf '\t%s\n' "-h,--help: Prints help"
}
parse_commandline ()
{
while test $# -gt 0
do
_key="$1"
case "$_key" in
--no-auto-recover-standby|--auto-recover-standby)
_arg_auto_recover_standby="on"
test "${1:0:5}" = "--no-" && _arg_auto_recover_standby="off"
;;
--no-auto-recover-primary|--auto-recover-primary)
_arg_auto_recover_primary="on"
test "${1:0:5}" = "--no-" && _arg_auto_recover_primary="off"
;;
--lock-timeout-minutes)
test $# -lt 2 && die "Missing value for the optional argument '$_key'." 1
_arg_lock_timeout_minutes="$2"
shift
;;
--lock-timeout-minutes=*)
_arg_lock_timeout_minutes="${_key##--lock-timeout-minutes=}"
;;
-h|--help)
print_help
exit 0
;;
-h*)
print_help
exit 0
;;
*)
_PRINT_HELP=yes die "FATAL ERROR: Got an unexpected argument '$1'" 1
;;
esac
shift
done
}
parse_commandline "$@"
# OTHER STUFF GENERATED BY Argbash
### END OF CODE GENERATED BY Argbash (sortof) ### ])
# [ <-- needed because of Argbash
printf "'%s' is %s\\n" 'auto-recover-standby' "$_arg_auto_recover_standby"
printf "'%s' is %s\\n" 'auto-recover-primary' "$_arg_auto_recover_primary"
printf "'%s' is %s\\n" 'lock-timeout-minutes' "$_arg_lock_timeout_minutes"
PIDFILE=/home/postgres/recover_failed_node.pid
trap cleanup EXIT
DELEGATE_IP=192.168.122.99
### adapt on each node ###
PGP_NODE_ID=0
PGP_STATUS_WAITING=1
PGP_STATUS_UP=2
PGP_STATUS_DOWN=3
LOGFILENAME="${PROGNAME%.*}.log"
LOGFILE=/var/log/postgres/${LOGFILENAME}
if [ ! -f $LOGFILE ] ; then
touch $LOGFILE
fi
log_info(){
echo $( date +"%Y-%m-%d %H:%M:%S.%6N" ) - INFO - $1 | tee -a $LOGFILE
}
log_error(){
echo $( date +"%Y-%m-%d %H:%M:%S.%6N" ) - ERROR - $1 | tee -a $LOGFILE
}
cleanup(){
# remove pid file but only if it is mine, dont remove if another process was running
if [ -f $PIDFILE ] ; then
MYPID=$$
STOREDPID=$(cat $PIDFILE)
if [ "${MYPID}" == "${STOREDPID}" ] ; then
rm -f $PIDFILE
fi
fi
if [ -z $INSERTED_ID ] ; then
return
fi
log_info "delete from recover_failed with id $INSERTED_ID"
psql -U repmgr -h $DELEGATE_IP -p 9999 repmgr -t -c "delete from recover_failed_lock where id=${INSERTED_ID};"
}
# test if there is lock in the recover_failed_lock table
# return 0 if there is no lock, 1 if there is one
is_recovery_locked(){
# create table if not exists
psql -U repmgr -h $DELEGATE_IP -p 9999 repmgr -c "create table if not exists recover_failed_lock(id serial,ts timestamp with time zone default current_timestamp,node varchar(10) not null,message varchar(120));"
#clean-up old records
psql -U repmgr -h $DELEGATE_IP -p 9999 repmgr -c "delete from recover_failed_lock where ts < current_timestamp - INTERVAL '1 day';"
# check if there is already an operation in progress
str=$(psql -U repmgr -h $DELEGATE_IP -p 9999 repmgr -c "select ts,node from recover_failed_lock where ts > current_timestamp - INTERVAL '${_arg_lock_timeout_minutes} min';")
if [ $? -ne 0 ] ; then
log_error "psql error when selecting from recover_failed_lock table"
exit 1
fi
echo $str | grep "(0 rows)"
if [ $? -eq 0 ] ; then
return 0
fi
log_info "there is a lock record in recover_failed_lock : $str"
return 1
}
# take a lock on the the recovery operation
# by inserting a record in table recover_failed_lock
# fails if there is already a recovery running (if an old record still exist)
# exit -1 : error
# return 0: cannot acquire a lock because an operation is already in progress
lock_recovery(){
MSG=$1
# create table if not exists
psql -U repmgr -h $DELEGATE_IP -p 9999 repmgr -c "create table if not exists recover_failed_lock(id serial,ts timestamp with time zone default current_timestamp,node varchar(10) not null,message varchar(120));"
if [ $? -ne 0 ] ; then
log_error "Cannot create table recover_failed_lock table"
exit 1
fi
is_recovery_locked
if [ $? -eq 1 ] ; then
return 0
fi
str=$(psql -U repmgr -h $DELEGATE_IP -p 9999 repmgr -t -c "insert into recover_failed_lock (node,message) values ('${NODE_NAME}','${MSG}') returning id;")
if [ $? -ne 0 ] ; then
log_info "cannot insert into recover_failed_lock"
exit 1
fi
INSERTED_ID=$(echo $str | awk '{print $1}')
log_info "inserted lock in recover_failed_log with id $INSERTED_ID"
return $INSERTED_ID
}
pg_is_in_recovery(){
psql -t -c "select pg_is_in_recovery();" | head -1 | awk '{print $1}'
}
check_is_streaming_from(){
PRIMARY=$1
# first check if is_pg_in_recovery is t
in_reco=$( pg_is_in_recovery )
if [ "a${in_reco}" != "at" ] ; then
return 0
fi
psql -t -c "select * from pg_stat_wal_receiver;" > /tmp/stat_wal_receiver.tmp
# check that status is streamin
status=$( cat /tmp/stat_wal_receiver.tmp | head -1 | cut -f2 -d"|" | sed -e "s/ //g" )
if [ "a${status}" != "astreaming" ] ; then
log_info "status is not streaming"
return 0
fi
#check that is recovering from primary
conninfo=$( cat /tmp/stat_wal_receiver.tmp | head -1 | cut -f14 -d"|" )
echo $conninfo | grep "host=${PRIMARY}"
if [ $? -eq 1 ] ; then
log_info "not streaming from $PRIMARY"
return 0
fi
return 1
}
# arg: 1 message
recover_failed_master(){
# try to acquire a lock
MSG=$1
lock_recovery "$MSG"
LOCK_ID=$?
if [ ${LOCK_ID} -eq 0 ] ; then
log_info "cannot acquire a lock, probably an old operation is in progress ?"
return 99
fi
log_info "acquired lock $LOCK_ID"
#echo "First try node rejoin"
#echo "todo"
log_info "Do pcp_recovery_node of $PGP_NODE_ID"
pcp_recovery_node -h $DELEGATE_IP -p 9898 -w $PGP_NODE_ID
ret=$?
cleanup
return $ret
}
recover_standby(){
#dont do it if there is a lock on recover_failed_lock
is_recovery_locked
if [ $? -eq 1 ] ; then
return 0
fi
if [ "$_arg_auto_recover_standby" == "on" ] ; then
log_info "attach node back since it is in recovery streaming from $PRIMARY_NODE_ID"
pcp_attach_node -h $DELEGATE_IP -p 9898 -w ${PGP_NODE_ID}
if [ $? -eq 0 ] ; then
log_info "OK attached node $node back since it is in recovery and streaming from $PRIMARY_NODE_ID"
exit 0
fi
log_error "attach node failed for node $node"
exit 1
else
log_info "auto_recover_standby is off, do nothing"
exit 0
fi
}
if [ -f $PIDFILE ] ; then
PID=$(cat $PIDFILE)
ps -p $PID > /dev/null 2>&1
if [ $? -eq 0 ] ; then
log_info "script already running with PID $PID"
exit 0
else
log_info "PID file is there but script is not running, clean-up $PIDFILE"
rm -f $PIDFILE
fi
fi
echo $$ > $PIDFILE
if [ $? -ne 0 ] ; then
log_error "Could not create PID file"
exit 1
fi
log_info "Create $PIDFILE with value $$"
str=$( pcp_node_info -h $DELEGATE_IP -p 9898 -w $PGP_NODE_ID )
if [ $? -ne 0 ] ; then
log_error "pgpool cannot be accessed"
rm -f $PIDFILE
exit 1
fi
read node port status weight status_name role date_status time_status <<< $str
if [ $status -ne $PGP_STATUS_DOWN ] ; then
log_info "pgpool status for node $node is $status_name and role $role, nothing to do"
rm -f $PIDFILE
exit 0
fi
log_info "Node $node is down (role is $role)"
# status down, the node is detached
# get the primary from pool_nodes
psql -h $DELEGATE_IP -p 9999 -U repmgr -c "show pool_nodes;" > /tmp/pool_nodes.log
if [ $? -ne 0 ] ; then
log_error "cannot connect to postgres via pgpool"
rm -f $PIDFILE
exit 1
fi
PRIMARY_NODE_ID=$( cat /tmp/pool_nodes.log | grep primary | grep -v down | cut -f1 -d"|" | sed -e "s/ //g")
PRIMARY_HOST=$( cat /tmp/pool_nodes.log | grep primary | grep -v down | cut -f2 -d"|" | sed -e "s/ //g")
log_info "Primary node is $PRIMARY_HOST"
# check if this node is a failed master (degenerated master)
# if yes then pcp_recovery_node or node rejoin is needed
if [ $role == "primary" ] ; then
# this should never happen !!
log_info "This node is a primary and it is down: recovery needed"
# sanity check
if [ $PRIMARY_NODE_ID -ne $PGP_NODE_ID ] ; then
log_error "Unpextected state, this node $PGP_NODE_ID is a primary according to pcp_node_info but pool_nodes said $PRIMARY_NODE_ID is master"
rm -f $PIDFILE
exit 1
fi
if [ "$_arg_auto_recover_primary" == "on" ] ; then
recover_failed_master "primary node reported as down in pgpool"
ret=$?
rm -f $PIDFILE
exit $ret
else
log_info "auto_recover_primary is off, do nothing"
rm -f $PIDFILE
exit 0
fi
fi
log_info "This node is a standby and it is down: check if it can be re-attached"
log_info "Check if the DB is running, if not do not start it but exit with error"
pg_ctl status
if [ $? -ne 0 ] ; then
log_error "the DB is not running"
rm -f $PIDFILE
exit 1
fi
check_is_streaming_from $PRIMARY_HOST
res=$?
if [ $res -eq 1 ] ; then
recover_standby
ret=$?
rm -f $PIDFILE
exit $ret
fi
if [ "$_arg_auto_recover_primary" == "on" ] ; then
log_info "node is standby in pgpool but it is not streaming from the primary, probably a degenerated master. Lets do pcp_recovery_node"
recover_failed_master "standby node not streaming from the primary"
ret=$?
rm -f $PIDFILE
exit $ret
else
log_info "node is supposed to be a standby but it is not streaming from the primary, however auto_recovery_primary is off so do nothing"
fi
rm -f $PIDFILE
exit 0
pgpool graphical interface
I developed a web based application on top of pgpool and postgres in order to visualize the cluster (state of the watchdog and state of the replication). For now the easiest way to use it is to clone the repo from github, build a docker image containing the app and then run the docker image either on one of the node or on another server (it needs access to the posgres servers via ssh)
I will go through the process to build the docker image and start it on my host. Note that depending on how you install docker (from docker-ce or from centos), you might have to use sudo to run docker commands as non-root user. Note that it requires a version of docker that supports multi-stage build, so go for docker-ce https://docs.docker.com/install/linux/docker-ce/centos/
# clone the github project
git clone git@github.com:saule1508/pgcluster.git
# get into the directory containing the app
cd pgcluster/manager
# get into the build directory
cd build
# NB: the user must be in the group docker, otherwise you need to add a sudo in front of the docker commands
# in the script build.bash
./build.bash
At the end of the script a docker image should be created, it should say “Successfully tagged manager:
The container needs an environment variable PG_BACKEND_NODE_LIST which is a csv list of the backends. It also needs the password for the user repmgr. So in this case:
PG_BACKEND_NODE_LIST: 0:pg01.localnet:5432:1:/u01/pg11/data:ALLOW_TO_FAILOVER,1:pg02.localnet:5432:1:/u01/pg11/data:ALLOW_TO_FAILOVER,2:pg03.localnet:5432:1:/u01/pg11/data:ALLOW_TO_FAILOVER
REPMGRPWD: rep123
the docker container should also be capable to connect to the 3 postgres nodes via ssh, without a password: to make this possible we will need to copy over the public key stored in the image to the 3 servers (the public key must be added to /root/.ssh/authorized_keys on the backends, we can use ssh-copy-id for that)
start the container, in order to copy the public key
docker run -ti --net host --entrypoint bash manager:0.8.0
# now we are inside the container, check that you can ping the backends
ping pg01.localnet
ping pg02.localnet
ping pg03.localnet
# copy the public key
ssh-copy-id postgres@pg01.localnet
ssh-copy-id postgres@pg02.localnet
ssh-copy-id postgres@pg03.localnet
# exit the container
exit
The tool executes a script called /opt/postgres/scripts/checks.sh on the backend, so we must install it on the 3 servers
#!/bin/bash
repmgr node check | grep -v "^Node" | while read line
do
ck=$(echo $line | sed -e "s/\t//" -e "s/ /_/g" | cut -f1 -d":")
res=$(echo $line | cut -f2- -d":")
echo repmgr,$ck,$res
done
df -k ${PGDATA} | grep -v "^Filesystem" | awk '{print "disk,"$NF","$5","$2","$3}'
df -k /backup | grep -v "^Filesystem" | awk '{print "disk,"$NF","$5","$2","$3}'
df -k /archive | grep -v "^Filesystem" | awk '{print "disk,"$NF","$5","$2","$3}'
And now we can run the container. It will expose the port 8080, so we must open this port on the firewall
firewall-cmd --add-port 8080/tcp --permanent
systemctl restart firewalld
systemctl restart docker
docker run -d --net host \
-v /var/run/docker.sock:/var/run/docker.sock \
-e PG_BACKEND_NODE_LIST=0:pg01.localnet:5432:1:/u01/pg11/data:ALLOW_TO_FAILOVER,1:pg02.localnet:5432:1:/u01/pg11/data:ALLOW_TO_FAILOVER,2:pg03.localnet:5432:1:/u01/pg11/data:ALLOW_TO_FAILOVER \
-e REPMGRPWD:rep123 \
-e DBHOST=192.168.122.99 \
-e SCRIPTSDIR=/opt/postgres/scripts -e SSHPORT=22 -e BACKUPDIR=/u02/backup \
--name pgpoolmanager manager:0.8.0