Skip to content

Configuring Salt Master for High Availability

Overview

This guide describes a deterministic Salt master HA model for environments using Aria Automation and RaaS.

Design goals:

  • Keep salt-master-1 as the normal active master.
  • Keep salt-master-2 as warm standby.
  • Use Git as the source of truth for states and pillar.
  • Keep minions configured for dual-master failover.
  • Accept new minion keys only on the active primary (salt-master-1) during normal operation.
Component Role
salt-master-1 Primary provisioning master and PKI source
salt-master-2 HA execution master and PKI replica
RaaS (SSEAPI) Dispatches jobs to masters in the configured SSE cluster

Important

In SaltStack Config environments with multiple Salt masters in the same SSE cluster, runner jobs may execute on all masters simultaneously.

For normal state execution this behaviour is harmless because minions connect to only one master at a time.

However provisioning operations that use the saltify driver (such as Aria Automation minion deployments) are not safe for concurrent execution.

If two masters attempt to deploy a minion at the same time the target machine may encounter file locking errors such as STATUS_SHARING_VIOLATION.

This results in the Aria Automation deployment being marked as failed even though the minion installation may have succeeded.

Architecture and Design Validation

This design is valid for active/passive master operation where predictable failover is preferred over load balancing.

What this design explicitly does not provide:

  • Shared job cache between masters.
  • Shared mine data between masters.

Operational implications:

  • Job history remains local to each master.
  • Mine data repopulates after failover based on configured mine intervals.

Git and Master Prerequisites

To ensure that state files and the data they reference from pillars are the same both masters must be configured identically for both GitFS and pillar:

  • Same gitfs_remotes
  • Same ext_pillar Git remotes
  • Same branch model (dev, test, prod)
  • Same deploy keys/access

Validation example:

salt-run fileserver.update
salt-run fileserver.file_list saltenv=dev
salt-run fileserver.file_list saltenv=test
salt-run fileserver.file_list saltenv=prod

Minion Configuration (Dual Master Failover)

Existing salt-minions will not be aware of a new 2nd master and need to be updated.

On each minion create /etc/salt/minion.d/ha.conf:

master:
  - salt-master-1.domain.local
  - salt-master-2.domain.local
master_type: failover
retry_dns: 30
random_master: False

Behavior:

  • Minion prefers master 1.
  • Minion fails over to master 2 automatically.
  • Minion retries primary when available again.

State example for rollout:

/etc/salt/minion.d/ha.conf:
  file.managed:
    - user: root
    - group: root
    - mode: "0644"
    - contents: |
        master:
          - salt-master-1.domain.local
          - salt-master-2.domain.local
        master_type: failover
        retry_dns: 30
        random_master: False

restart_salt_minion:
  service.running:
    - name: salt-minion
    - enable: True
    - watch:
      - file: /etc/salt/minion.d/ha.conf

Apply:

salt "*" state.apply minion.ha_dual_master

Aria Automation Blueprint Changes

For newly provisioned machines, update the Aria Automation blueprint so Salt minions are configured for dual-master failover at deployment time.

In the Cloud.SaltStack resource, add an additionalMinionParameters section. In this section we define the order of preference of the salt masters and define the failover mode.

additionalMinionParameters:
  master:
    - salt-master-1
    - salt-master-2
  master_type: failover
  master_alive_interval: 60
  master_failback: True
  master_failback_interval: 30
  master_shuffle: False
  random_master: False

This prevents new deployments from registering with only one master and keeps blueprint-driven provisioning aligned with the HA design.

Provisioning Behaviour with Multiple Masters

When SaltStack Config is configured with multiple masters in the same SSE cluster, runner jobs may execute on all masters.

For normal Salt operations this is not an issue because each minion connects to only one master at a time.

However Aria Automation installs Salt minions using the saltify cloud driver.

The deployment workflow performs the following actions:

  1. Copy the Salt installer to the target system
  2. Install the Salt minion
  3. Start the service and accept the key

If two masters attempt this process simultaneously the target host may experience file locking errors during the file copy stage.

Typical error example:

STATUS_SHARING_VIOLATION

This condition can cause Aria Automation to report a failed deployment even though the Salt minion installation has completed successfully.

For environments integrating Aria Automation with SaltStack Config HA masters, provisioning and runtime execution should be separated.

Recommended design:

Master Role
salt-master-1 Primary provisioning master
salt-master-2 Failover execution master

Provisioning flow:

Aria Automation -> RaaS -> salt-master-1

Runtime execution flow:

salt-minion -> connected master (failover capable)

Minions should still be configured with both masters:

master:
  - salt-master-1.domain.local
  - salt-master-2.domain.local
master_type: failover

This ensures runtime high availability while preventing provisioning race conditions.

PKI Replication Model

To ensure minions can seamlessly fail over between masters, the master identity keys must be identical on both Salt masters.

Newly accepted minions on master 1 must also be periodically synchronised with the 2nd master.

Design:

  • Master 1 is the PKI source.
  • Master 2 pulls accepted minion keys from master 1.
  • New keys are accepted only on master 1 during normal operation.

Master identity files that must match on both masters:

  • /etc/salt/pki/master/master.pem
  • /etc/salt/pki/master/master.pub
  • /etc/salt/pki/master/sseapi_key.pem

Master 1 Setup (PKI Source)

Minion on Master 1

On salt-master-1, set its local minion to target local master:

master: localhost

Then:

systemctl restart salt-minion

SSH Access for Pull Sync

Allow SSH key-based root access from salt-master-2 to salt-master-1 for PKI sync operations.

Example drop-in on master 1 (/etc/ssh/sshd_config.d/salt-ha.conf):

PermitRootLogin prohibit-password
PubkeyAuthentication yes
KbdInteractiveAuthentication no

Reload sshd:

systemctl reload sshd

Optional hardening in authorized_keys:

from="salt-master-2-ip",no-agent-forwarding,no-port-forwarding,no-X11-forwarding,no-pty ssh-ed25519 AAAA...

Master 2 Setup (PKI Replica)

Generate SSH key and trust to master 1:

ssh-keygen -t ed25519 -f /root/.ssh/id_ed25519 -N ""
ssh-copy-id root@salt-master-1.domain.local
ssh root@salt-master-1.domain.local hostname

Create pull state (salt://ha/pki/pull_from_master1.sls):

{% set source_master = "salt-master-1.domain.local" %}

/etc/salt/pki/master/minions:
  file.directory:
    - user: root
    - group: root
    - mode: "0700"
    - makedirs: True

sync_minion_keys:
  cmd.run:
    - name: >
        rsync -az --delete
        --no-perms --no-owner --no-group
        -e "ssh"
        root@{{ source_master }}:/etc/salt/pki/master/minions/
        /etc/salt/pki/master/minions/
    - shell: /bin/bash

Test:

salt "salt-master-2" state.apply ha.pki.pull_from_master1

Scheduled Safety Sync

On salt-master-2 create /etc/salt/minion.d/pki-sync.conf:

schedule:
  pki_sync:
    function: state.apply
    args:
      - ha.pki.pull_from_master1
    minutes: 10

Then:

systemctl restart salt-minion
salt-call --local schedule.list

Operational Procedure

Normal Operation

  • Provisioning jobs are initiated through salt-master-1. Execution jobs may run on either master depending on which master the minion is connected to.
  • Minions prefer salt-master-1.
  • Master-2 operates as warm standby.
  • PKI sync runs periodically from master-1.

Master 1 Failure

  1. Confirm salt-master-1 outage.
  2. RaaS begins dispatching to salt-master-2.
  3. Confirm minions reconnect to salt-master-2.
  4. Continue operations on the standby master.

Master 1 Recovery

  1. Restore master 1 services.
  2. Validate key sync from master 1 to master 2.
  3. Validate RaaS and Aria end-to-end execution.

Validation Checklist

Pre go-live:

  • [ ] GitFS and pillar content resolve identically on both masters.
  • [ ] Master identity PEM files are identical on both masters.
  • [ ] Minions contain dual-master failover config.
  • [ ] PKI pull state succeeds manually.
  • [ ] Scheduled key sync is visible in local schedule.

Failover test:

  1. Stop salt-master on master 1.
  2. Confirm minion reconnection to master 2.
  3. Run a configuration state from Aria Automation and confirm Salt execution. Provisioning tests should only be performed when the primary master is active.
  4. Restore master 1 and validate failback.

See also