Salt Configuration Notes
Here are just some random notes on installing and configuring salt-master and salt-minion services. I figured I better write this down before I forget. Most of this is covered in official documentation, but in a very nonchalant manner with no practical examples. Something along the lines of “Step 1: build a particle accelerator; Step 2: tune the TCP/IP stack…” Additionally, SaltStack documentation tends to bring things to your attention out of logical sequence. To quote Robbie from The Wedding Singer, “Gee, you know that information… really would’ve been more useful to me yesterday!” But enough complaining…
Color Output
I would strongly recommend you disable the colored output on the salt-master. While looking pretty, it adds non-printable characters that can and will screw things up for you, should you try to pipe or otherwise redirect output to a shell command or script.
# By default output is colored, to disable colored output set the color value # to False color: False
Network Tuning
Apparently, it would help tremendously to actually go ahead and tune your TCP/IP stack on the salt-master box before adding hundreds of minions. Here’s what’s recommended in the official documentation, chapter 5 “Troubleshooting”:
echo 16777216 > /proc/sys/net/core/rmem_max echo 16777216 > /proc/sys/net/core/wmem_max echo "4096 87380 16777216" > /proc/sys/net/ipv4/tcp_rmem echo "4096 87380 16777216" > /proc/sys/net/ipv4/tcp_wmem
Naturally, if you want these settings to survive the next reboot, you better add them to /etc/sysctl.conf, or wherever your particular Linux flavor keeps the scary kernel settings. Personally, I would recommend the following:
sysctl -w net.core.rmem_max=16777216 sysctl -w net.core.wmem_max=16777216 sysctl -w net.core.rmem_default=87380 sysctl -w net.core.wmem_default=87380 sysctl -w net.ipv4.tcp_rmem='4096 87380 16777216' sysctl -w net.ipv4.tcp_wmem='4096 87380 16777216' sysctl -w net.ipv4.tcp_mem='16777216 16777216 16777216' sysctl -w net.ipv4.route.flush=1
And add this to your /etc/sysctl.conf:
net.core.rmem_max = 16777216 net.core.wmem_max = 16777216 net.core.rmem_default = 87380 net.core.wmem_default = 87380 net.ipv4.tcp_rmem = 4096 87380 16777216 net.ipv4.tcp_wmem = 4096 87380 16777216 net.ipv4.tcp_mem = 16777216 16777216 16777216
Minion Status
You have these three basic commands that will show you the list and status of your minions:
salt-run manage.up salt-run manage.down salt-run manage.status
The first shows you minions that are responding. The second shows you minions that are not responding. And the last one shows you all the known minions, regardless of whether they are responding or not. You can also check the following two files:
ls -1 /var/cache/salt/master/minions ls -1 /etc/salt/pki/master/minions
Sometimes you may need to show the unresponsive minions and count them with a single command. Here’s how you do it:
salt-run manage.down 2>/dev/null | tee /dev/tty | wc -l
So, what if the minion is not responding, but the salt-minion service is up and running and everything looks hunky-dory? Well, one thing to keep in mind is the network performance limitations that affect your salt-master’s ability to communicate with all the minions during the default timeout window, which is only five seconds. If you have a lot of minions, you may want to increase the timeout to 10 or 15 seconds.
To determine the best value, first, count all your minions:
salt-run manage.status | wc -l
Now, count those that respond within 5, 10, 15 and more seconds:
for i in `seq 5 5 30` do echo -ne "Timeout ${i}s:\t" salt-run --timeout=${i} manage.up | wc -l done
You are looking for the minimum timeout value where minion count is closest to what you got with the “manage.status” command. If the optimal timeout value seems too high, you will need to revisit network performance of your salt-master. Also, keep in mind that a very busy minion will take longer to respond. So don’t be too quick to blame the issue on the salt-master.
Once you determined the acceptable timeout value, you can put it in the salt-master configuration file, so you don’t have to specify it from command line every time:
/bin/cp -pf /etc/salt/master /etc/salt/master_`date +'%Y-%m-%d_%H%M%S'` echo "timeout: 10" >> /etc/salt/master /sbin/service salt-master restart
If you have a lot of minions that routinely run at high system load (such as may be the case with HPC clusters), modifying salt-minion startup script to launch with higher process priority may do the trick. Finally, just remember that communicating with hundreds of diverse systems on the network is bound to result in a few errors now and then. Don’t strive for a hundred percent reliability: you’ll never get there.
Here’s a more interesting problem: say, the salt-minion process dies on one of the clients. How do you restart it? You can SSH to the node and restart manually, or write a script that will feed the output of “salt-run manage.down” to an SSH script that will restart the minion service on the affected clients. Here’s a simple example that requires passwordless SSH and sudo privileges:
u=your_username ; for i in `salt-run manage.down` ; do echo "Restarting salt-minion on ${i}" ; ssh -qt -i `grep ^${u}: /etc/passwd | awk -F':' '{print $6}'`/.ssh/id_rsa ${u}@${i} -o StrictHostKeyChecking=no -o PubkeyAuthentication=yes -o PasswordAuthentication=no -o BatchMode=yes "sudo su - root -c '/sbin/service salt-minion restart'" ; done
Note: you must have color output disabled in /etc/salt/master if you are going to feed the output of “salt-run” to a shell process.
Here’s another example that shows how to fix a typo in /etc/salt/minion and restart salt-minion service on multiple nodes. In this case I fat-fingered “recon_default: 1000ms” instead of “recon_default: 1000”.
u=your_username; for i in `salt-run manage.down` ; do echo "Restarting salt-minion on ${i}" ; ssh -qt -i `grep ^${u}: /etc/passwd | awk -F':' '{print $6}'`/.ssh/id_rsa ${u}@${i} -o StrictHostKeyChecking=no -o PubkeyAuthentication=yes -o PasswordAuthentication=no -o BatchMode=yes "sudo su - root -c 'sed -i s/1000ms/1000/g /etc/salt/minion ; /sbin/service salt-minion restart ; /sbin/service salt-minion restart'" ; done
To speed-up this process, you may want to use PDSH. Here’s a quick example showing how to restart salt-minion service on multiple nodes. This is a two-step process. The first step is to generate a list of minions in the format accepted by PDSH. And the second step is to run the PDSH command itself:
salt-run manage.down | while read line ; do echo -n "${line}," >> /tmp/pdsh_nodelist.txt ; done /usr/bin/pdsh -b -N -t 10 -u 15 -w ^/tmp/pdsh_nodelist.txt "sudo su - root -c '/sbin/service salt-minion restart'"
However, the whole point (in my mind, at least) of using Salt is that I don’t have to deal with SSH, keys, passwords, etc. I am thinking a better solution is to have a cron job on every minion that will restart salt-minion service on a regular basis. Something like this should be sufficient, I think:
15 0 * * * /sbin/service salt-minion restart >/dev/null 2>&1
If you have a massive number of minions, you probably want to avoid all of the restarting at precisely the same time. So, when adding the cron job, you may want to randomize the time a little bit to avoid the “thundering herd” issue:
echo "`expr $RANDOM % 60` 0 * * * /sbin/service salt-minion restart >/dev/null 2>&1" | tee -a /var/spool/cron/root
The example above will create a cron job to restart salt-minion service some time between midnight and 1 am. You should pick this restart time window carefully and avoid running other Salt tasks during that time.
To minimize the aforementioned “thundering herd” problem in large Salt deployments, you can also apply some of the settings recommended in SaltStack documentation:
salt "*" -b 50 cmd.run "echo 'random_reauth_delay: 120' >> /etc/salt/minion" salt "*" -b 50 cmd.run "echo 'recon_default: 1000' >> /etc/salt/minion" salt "*" -b 50 cmd.run "echo 'recon_max: 59000' >> /etc/salt/minion" salt "*" -b 50 cmd.run "echo 'recon_randomize: True' >> /etc/salt/minion" salt "*" -b 50 cmd.run "/sbin/service salt-minion restart"
This should keep the minions from bothering the master server quite so often.