Draft: erp5: Introduce mariadb replication at SlapOS level
mariadb_update
service
1. Remove Instead, initialize databases and users on creation, and run updater and apply timezones info on every (re)start. This covers the actions that mariadb_update
used to handle.
In particular: before this, mariadb_update
would regularly overwrite any changes to a user (e.g. password change) made through direct interaction with mariadb. Now the configuration in SlapOS is really only an initial configuration.
This is a prerequisite to mariadb replication because mariadb_update
was a) interfering with replication and b) overwriting the users replicated from a primary.
To facilitate these changes, component/mariadb now exposes a template script for the mariadbd service, with ready hooks to take actions on database creation and on database (re)start.
2. Allow requesting a mariadb set-up to replicate another mariadb
Using parameters of the form:
'replication': {
'bootstrap-url': 'http(s)://<recent-backup-of-primary>,
'primary-url': 'mysql://<replication-user>:<password>@<ip>:<port>',
'seconds-behind-master-threshold': <integer, defaults to 0>,
}
This takes effect on mariadb database creation - when no data exists yet. That way existing data cannot be deleted by setting or changing the replication parameters after the fact.
A promise checks that the state of the running mariadb matches the requested state (replica/primary, replication source); but if not, the mariadb database will not automatically converge without human intervention once ~/srv/mariadb directory exists.
The bootstrap-url may be omitted: this skips replication bootstrap and requires that all binlogs be still available on the primary. This is useful when the primary is recent and may not have a ready backup for bootstrap yet.
Finally, a mariadb replica can optionally disable TCP access:
'replication': {
# ...
'allow-tcp-connections-on-replica': True or False,
}
Add option allow-tcp-connections-on-replica
, set to true by default. This option concerns only replica mariadbs: TCP connections are always enabled when replication parameters are unset, even if the database is actually in replication state in contradiction with the parameters.
This option corresponds to skip-networking
in mariadb configuration; this setting is static, so when it changes the mariadb process will be automatically restarted by SlapOS to apply the new configuration.
Note: disabling TCP connections on replicas with this option currently breaks the property that takoever can be done without having to change the instance parameters and reprocess the partition, as until then the taken-over mariadb will still have TCP disabled and remain unusable.
TODO:
-
Allow a replica mariadb to stop replicating and become a primary without requiring manual login to the instance and manual operations on the DB (e.g. by providing a url where the user can click to perform this action). This will be a necessary step of an eventual automated takeover procedure. -
Find a better solution for mariadb_update
functionality. See #1. -
Make the mariadb_replication
promise avoid needless partition processing (bang): currently, it the will trigger a bang when the state of mariadb (replica/primary, replication source) does not match the expected state (corresponding to the parameters), even though SlapOS only controls the initial state on database creation, and reprocessing the partition will by-design not make it converge to the expected state. -
For mariadb replicas requested with allow-tcp-connections-on-replica=false
(which results inskip-networking
being written in the config file), find a way to takoever without needing to edit its instance parameters and reprocess the partition. This requires a way to restart mariadb with different parameters with different options, using only the privileges of the partition. This could maybe be done by wrapping the mariadb service in a wrapper program (maybe an ad-hoc script, maybe supervisord) that allows restarting mariadb withskip-networking
enabled or disabled as appropriate. Note that whenallow-tcp-connections-on-replica=true
, takoever does not require editing the instance parameters nor reprocessing (which is the main reasontrue
is the current default).
3. Automate mariadb replication bootstrapping
Make any mariadb (replica or primary) a) statically serve recent backups (dumps) on the same IP as the mariadb server and b) have a configured replication_user
with random password, and publish two corresponding connection parameters replication-bootstrap-url
and replication-primary-url
, to be used to setup a replica mariadb.
TODO:
-
Use mariabackup instead of dumps to allow fast bootstrapping of a replica. This will affect the replica-initialisation logic as well. -
Propagate these mariadb connection parameters in erp5 root instance.
-->mariabd-replication-primary-url
andmariadb-replication-bootstrap-url
4. Authenticate and Encrypt Primary <---> Replica communications with TLS
Feature
-
Use TLS on public IPv6: a) serve the backups with TLS on IPv6 and b) proxy the mariadb server with TLS on IPv6. Another option would be to enable TLS in mariadb directly, but this allows decoupling mariadb user configuration from TLS configuration, allows to make sure all users are protected by TLS, and allows using TLS on IPv6 and not on IPv4.
Each mariadb instance (whether in primary or replica mode) now has by default a dedicated caucased
server. This caucased
is conceptually responsible for authenticating and encrypting access to that mariadb instance — _ although currently only when that access is over IPv6_.
Besides the caucased
itself, a caucase user certificate (user meaning admin) for this caucased
is automatically issued inside the same instance. This certificate is then used by a dedicated service inside the instance to sign Certificate Signing Requests (CSR) that are passed via instance parameters.
To control whether caucased is enabled and to pass CSRs to sign, mariadb now takes parameters of the form:
"caucased": {
"enable": true or false, true by default,
"csr-to-sign": <PEM-encoded string representing one or more CSRs>
}
In addition, two caucase service certificates are also automatically issued in the instance. Conceptually these are used to authenticate and encrypt access to the mariadb server (ideally it should be one, but as a first step it was easier to have two, one for authenticating access to mariadb itself and one for authenticating access to the bootstrap (HTTP) server, because different bundling an naming conventions are expected).
The url of the caucased is published in connection parameters under replication-caucased-url
. When a replica instance is requested to replicate from a mariadb that uses caucased
, it should receive that caucased-url
in replication parameters like this:
"replication": {
"caucased-url": <primary-caucased-url>,
[...]
}
It then requests a certificate on the primary's caucased
and publishes the corresponding CSR
under caucased-csr-to-sign
. This can then be passed in the instance parameters of the primary to make the primary caucased
"semi-automatically" approve that CSR. After that the replica will obtain and keep up-to-date a certificate that it can use to connect to the primary.
This scheme allows to establish a secure encrypted connection with mutual authentication (mTLS) between two SlapOS instances, with zero knowledge in the SlapOS master. It is generic and could be reused for any instance-to-instance authentication in SlapOS. To obtain access to resources of an instance protected in this way, one has to prove they have the right to modify its instance parameters.
To enable access on IPv6 and IPv4 simultaneously, each mariadb now also has by default a proxy that listens on IPv6 and forwards connections to the mariadb server.
- If
caucased
is disabled, TLS is not enforced on IPv6 and this proxy is a singleHAProxy
that exposes both the mariadb server and the bootstrap (HTTP) server on IPv6. - If
caucased
is enabled, TLS is enforced on IPv6.HAProxy
is then used to proxy the bootstrap (HTTP) server and enforce TLS.ProxySQL
is used to proxy the mariadb server and enforce TLS.ProxySQL
is used in this case instead ofHAProxy
becauseHAProxy
is not compatible with the STARTTLS-like protocol the mariadb replica will use to connect.
These proxies are controlled by instance parameters of the form:
"haproxy": {
"enable": true or false, true by default
}
Despite the name, this also affects ProxySQL
.
If caucased
is enabled but haproxy
is disabled, the caucase service certificates are unused, and the mariadb server and the bootstrap (HTTP) server only listen on IPv4 (unless legacy parameter use-ipv6
is used, in which case they only listen on IPv6).
TODO
-
Use only one service certificate on the primary side, instead of two currently. -
Instead of auto-approving the service certificates for the mariadb server via caucased
's auto-approve feature, use thesign-csr
service to approve all service certificates, including the ones on the same instance. This will be safer (no risk another certificate than the one intended is auto-approved) and more consistent. -
Decide whether ProxySQL's lack of CRL support is an issue, and find a workaround or another solution if it is. -
Deprecate use-ipv6
somehow. -
Find a more generic name than "haproxy"
for the HAProxy and ProxySQL related parameters; maybe simplyIPv6-proxy
? -
Forward the publication of these new mariadb published parameters in erp5 root instance. -
Find better naming conventions for published parameters: replication-caucased-url
is suboptimal because that caucased-url may end-up being used to grant IPv6 access in other use-cases than replication.caucased-csr-to-sign
does not make it clear that the CSRs need to be signed by the primary's caucased, not the caucased of the replica. Generaly,caucased
and replication related parameters tend to be ambiguous in whether they refer to the local mariadb or the primary when the current instance is a replica. -
Add these new instance parameters in JSON Schema.
5. Automate neo asynchronous replication
When upstream-cluster and upstream-masters are given, also pass --backup to the neo master so that it converges automatically to BACKINGUP state.
In other words, when a neo is requested with upstream-
parameters, make it to automatically start in BACKINGUP state without requiring manual intervention. This applies only on neo database creation.
Add a promise that asserts neo is BACKINGUP state when upstream-
parameters are set (but does not assert neo is in RUNNING state when upstream-
parameters are unset, for backwards compatibility with current usage).
TODO:
-
Make the neo state promise avoid needless partition processing (bang): currently, it the will trigger a bang e.g. when the state is RUNNING and the promise expects BACKINGUP, even though SlapOS only controls the initial state on database creation, and reprocessing the partition will by-design not attempt to make it converge to the expected state.
6. Make zope aware of replication
Deactivate zope promises when the neo is expected to be BACKINGUP, as this makes the zope process crash, which is currently expected.
Deactivating the zope process entirely in that case is not desired because reactivating it would require updating instance parameters and reprocessing the partition. Instead, ideally, the zope service should adapt to the state of the neo.
Also, move zope service from etc/service to etc/run to make it not be "on-watch", so that when the neo is BACKINGUP and zope crashes, the partition does not bang and reprocess continuously. This seems ok because the promise already asserts the service is running.
TODO:
-
Adapt the zope service so that it detects when neo is in BACKINGUP state and goes on standby until neo is RUNNING as part of normal execution of the service, instead of crashing. One envisioned way is to wrap the existing zope service in a wrapper program that will handle this additional functionality, catch zope crashing, poll neo state or otherwise be notified of neo state changes, and relaunch zope as needed. Such a program could be an ad-hoc wrapper script, or maybe a supervisord launching a zope and a kind of neo-listener service. -
Standardize operations related to creating an ERP5 clone of a production ERP5: this implies creating a replica, "detaching" it (like taking it over without stopping the original primary), selectively start an admin zope while making sure activity zopes remain stopped, and change all that is required to prevent the clone from interfering with the actual production ERP5 before starting the remaining zopes. One way is simply to start only selected zope partitions via SlapOS. Another way could be that zope services may be started or stopped directly: this could also be achieved via a wrapper program such as supervisord, but would require it offers a remote interface. Or maybe the right thing to do would be to standardize a way to control network access of each partition via a firewall, so as to be able to selectively cut network access.
7. Miscellaneous fixes
Include some miscellaneous fixes for mariadb-with-IPv6 and gcc-version-for-Python2-SRs.