Centralized Ansible Management With Knockd + Auto-provisioning with AWS

Ansible is a great tool. We’ve been using it at my job with a fair amount of success. When it was chosen, we didn’t have a requirement for supporting Auto scaling groups in AWS. This offers a unique problem – we need machines to be able to essentially provision themselves when AWS brings them up. This has interesting implications outside of AWS as well. This article covers using the Ansible API to build just enough of a custom playbook runner to target a single machine at a time, and discusses how to wire it up to knockd, a “port knocking” server and client, and finally how to use user data in AWS to execute this at boot – or any reboot.

Ansible – A “Push” Model

Ansible is a configuration management tool used in orchestration of large pieces of infrastructure. It’s structured as a simple layer above SSH – but it’s a very sophisticated piece of software. Bottom line, it uses SSH to “push” configuration out to remote servers – this differs from some other popular approaches (like Chef, Puppet and CFEngine) where an agent is run on each machine, and a centralized server manages communication with the agents. Check out How Ansible Works for a bit more detail.

Every approach has it’s advantages and disadvantages – discussing the nuances is beyond the scope of this article, but the primary disadvantage that Ansible has is one of it’s strongest advantages: it’s decentralized and doesn’t require agent installation. The problem arises when you don’t know your inventory (Ansible-speak for “list of all your machines”) beforehand. This can be mitigated with inventory plugins. However, when you have to configure machines that are being spun up dynamically, that need to be configured quickly, the push model starts to break down.

Luckily, Ansible is highly compatible with automation, and provides a very useful python API for specialized cases.

Port Knocking For Fun And Profit

Port knocking is a novel way of invoking code. It involves listening to the network at a very low level, and listening for attempted connections to a specific sequence of ports. No ports are opened. It has its roots in network security, where it’s used to temporarily open up firewalls. You knock, then you connect, then you knock again to close the door behind you. It’s very cool tech.

The standard implementation of port knocking is knockd, included with  most major linux distributions. It’s extremely light weight, and uses a simple configuration file. It supports some interesting features, such as limiting the number of times a client can invoke the knock sequence, by commenting out lines in a flat file.

User Data In EC2

EC2 has a really cool feature called user data, that allows you to add some information to an instance upon boot. It works with cloud-init (installed on most AMIs) to perform tasks and run scripts when the machine is first booted, or rebooted.

Auto Scalling

EC2 provides a mechanism for spinning up instances based on need (or really any arbitrary event). The AWS documentation gives a detailed overview of how it works. It’s useful for responding to sudden spikes in demand, or contracting your running instances during low-demand periods.

Ansilbe + Knockd = Centralized, On-Demand Configuration

As mentioned earlier, Ansible provides a fairly robust API for use in your own scripts. Knockd can be used to invoke any shell command. Here’s how I tied the two together.

Prerequisites

All of my experimentation was done in EC2, using the Ubuntu 12.04 LTS AMI.

To get the machine running ansible configured, I ran the following commands:

$ sudo apt-get update
$ sudo apt-get install python-dev python-pip knockd
$ sudo pip install ansible

Note: its important that you install the python-dev package before you install ansible. This will provide the proper headers so that the c-based SSH library will be compiled, which is faster than the pure-python version installed when the headers are not available.

You’ll notice some information from the knockd package regarding how to enable it. Take note of this for final deployment, but we’ll be running knockd manually during this proof-of-concept exercise.

On the “client” machine, the one who is asking to be configured, you need only install knockd. Again, the service isn’t enabled by default, but the package provides the knock command.

EC2 Setup

We require a few things to be done in the EC2 console for this all to work.

First, I created a keypair for use by the tool. I called “bootstrap”. I downloaded it onto a freshly set up instance I designated for this purpose.

NOTE: It’s important to set the permissions of the private key correctly. They must be set to 0600.

I then needed to create a special security group. The point of the group is to allow all ports from within the current subnet. This gives us maximum flexibility when assigning port knock sequences.

Here’s what it looks like:

Depending on our circumstances, we would need to also open up UDP traffic as well (port knocks can be TCP or UDP based, or a combination within a sequence).

For the sake of security, a limited range of a specific type of connection is advised, but since we’re only communicating over our internal subnet, the risk here is minimal.

Note that I’ve also opened SSH traffic to the world. This is not advisable as standard practice, but it’s necessary for me since I do not have a fixed IP address on my connection.

Making It Work

I wrote a simple python script that runs a given playbook against a given IP address:

"""
Script to run a given playbook against a specific host
"""

import ansible.playbook
from ansible import callbacks
from ansible import utils

import argparse
import os, sys

parser = argparse.ArgumentParser(
    description="Run an ansible playbook against a specific host."
)

parser.add_argument(
    'host',
    help="The IP address or hostname of the machine to run the playbook against."
)

parser.add_argument(
    "-p",
    "--playbook",
    default="default.yml",
    metavar="PLAY_BOOK",
    help="Specify path to a specific playbook to run."
)

parser.add_argument(
    "-c",
    "--config_file",
    metavar="CONFIG_FILE",
    default="./config.ini",
    help="Specify path to a config file. Defaults to %(default)s."
)

def run_playbook(host, playbook, user, key_file):
    """
    Run a given playbook against a specific host, with the given username
    and private key file.
    """
    stats = callbacks.AggregateStats()
    playbook_cb = callbacks.PlaybookCallbacks(verbose=utils.VERBOSITY)
    runner_cb = callbacks.PlaybookRunnerCallbacks(stats, verbose=utils.VERBOSITY)

    pb = ansible.playbook.PlayBook(
        host_list=[host,],
        playbook=playbook,
        forks=1,
        remote_user=user,
        private_key_file=key_file,
        runner_callbacks=runner_cb,
        callbacks=playbook_cb,
        stats=stats
    )

    pb.run()

options = parser.parse_args()

playbook = os.path.abspath("./playbooks/%s" % options.playbook)

run_playbook(options.host, playbook, 'ubuntu', "./bootstrap.pem")

Most of the script is user-interface code, using argparse to bring in configuration options. One unimplemented feature is using an INI file to specify things like the default playbook, pem key, user, etc. These things are just hard coded in the call to run_playbook for this proof-of-concept implementation.

The real heart of the script is the run_playbook function. Given a host (IP or hostname), a path to a playbook file (assumed to be relative to a “playbooks” directory), a user and a private key, it uses the Ansible API to run the playbook.

This function represents the bare-minimum code required to apply a playbook to one or more hosts. It’s surprisingly simple – and I’ve only scratched the surface here of what can be done. With custom callbacks, instead of the ones used by the ansible-playbook runner, we can fine tune how we collect information about each run.

The playbook I used for testing this implementation is very simplistic (see the Ansible playbook documentation for an explaination of the playbook syntax):

---
- hosts: all
  sudo: yes
  tasks:
  - name: ensure apache is at the latest version
    apt: update_cache=yes pkg=apache2 state=latest
  - name: drop an arbitrary file just so we know something happened
    copy: src=it_ran.txt dest=/tmp/ mode=0777

It just installs and starts apache, does an apt-get update, and drops a file into /tmp to give me a clue that it ran.

Note that the hosts: setting is set to “all” – this means that this playbook will run regardless of the role or class of the machine. This is essential, since, again, the machines are unknown when they invoke this script.

For the sake of simplicity, and to set a necessary environment variable, I wrapped the call to my script in a shell script:

#!/bin/bash
export ANSIBLE_HOST_KEY_CHECKING=False
cd /home/ubuntu
/usr/bin/python /home/ubuntu/run_playbook.py $1 >> $1.log 2>&1

The $ANSIBLE_HOST_KEY_CHECKING environment variable here is necessary, short of futzing with the ssh configuration for the ubuntu user, to tell Ansible to not bother verifying host keys. This is required in this situation because the machines it talks to are unknown to it, since the script will be used to configure newly launched machines. We’re also running the playbook unattended, so there’s no one to say “yes” to accepting a new key.

The script also does some very rudimentary logging of all output from the playbook run – it creates logs for each host that it services, for easy debugging.

Finally, the following configuration in knockd.conf makes it all work:

[options]
        UseSyslog

[ansible]
        sequence    = 9000, 9999
        seq_timeout = 5
        Command     = /home/ubuntu/run.sh %IP%

The first configuration section [options], is special to knockd – its used to configure the server. Here we’re just asking knockd to log message to the system log (e.g. /var/log/messages).

The [ansible] section sets up the knock sequence for an machine that wants Ansible to configure it. The sequence set here (it can be anything – any port number and any number of ports >= 2) is 9000, 9999. There’s a 5 second timeout – in the event that the client doing the knocking takes longer than 5 seconds to complete the sequence, nothing happens.

Finally, the command to run is specified. The special %IP% variable is replaced when the command is executed by the IP address of the machine that knocked.

At this point, we can test the setup by running knockd. We can use the -vD options to output lots of useful information.

We just need to then do the knocking from a machine that’s been provisioned with the bootstrap keypair.

Here’s what it looks like (these are all Ubuntu 12.04 LTS instances):

On the “server” machine, the one with the ansible script:

$  sudo knockd -vD
config: new section: 'options'
config: usesyslog
config: new section: 'ansible'
config: ansible: sequence: 9000:tcp,9999:tcp
config: ansible: seq_timeout: 5
config: ansible: start_command: /home/ubuntu/run.sh %IP%
ethernet interface detected
Local IP: 172.31.31.48
listening on eth0...

On the “client” machine, the one that wants to be provisioned:

$ knock 172.31.31.48 9000 9999

Back on the server machine, we’ll see some output upon successful knock:

2014-03-23 10:32:02: tcp: 172.31.24.211:44362 -> 172.31.31.48:9000 74 bytes
172.31.24.211: ansible: Stage 1
2014-03-23 10:32:02: tcp: 172.31.24.211:55882 -> 172.31.31.48:9999 74 bytes
172.31.24.211: ansible: Stage 2
172.31.24.211: ansible: OPEN SESAME
ansible: running command: /home/ubuntu/run.sh 172.31.24.211

 

Making It Automatic With User Data

Now that we have a way to configure machines on demand – the knock could happen at any time, from a cron job, executed via a distributed SSH client (like fabric), etc – we can use the user data feature of EC2 with cloud-init to do the knock at boot, and every reboot.

Here is the user data that I used, which is technically cloud config code (more examples here):

#cloud-config
packages:
 - knockd

runcmd:
 - knock 172.31.31.48 9000 9999

User data can be edited at any time as long as an EC2 instance is in the “stopped” state. When launching a new instance, the field is hidden in Step 3, under “Advanced Details”:

User Data FieldOnce this is established, you can use the “launch more like this” feature of the AWS console to replicate the user data.

This is also a prime use case for writing your own provisioning scripts (using something like boto) or using something a bit higher level, like CloudFormation.

Auto Scaling And User Data

Auto Scaling is controlled via “auto scaling groups” and “launch configuration”. If you’re not familiar these can sound like foreign concepts, but they’re quite simple.

Auto Scaling Groups define how many instances will be maintained, and set up the events to scale up or down the number of instances in the group.

Launch Configurations are nearly identical to the basic settings used when launching an EC2 instance, including user data. In fact, user data is entered in on Step 3 of the process, in the “Advanced Details” section, just like when spinning up a new EC2 instance.

In this way, we can automatically configure machines that come up via auto scaling.

Conclusions And Next Steps

This proof of concept presents an exciting opportunity for people who use Ansible and have use cases that benefit from a “pull” model – without really changing anything about their setup.

Here are a few miscellaneous notes, and some things to consider:

  • There are many implementations of port knocking, beyond knockd. There is a huge amount of information available to dig into the concept itself, and it’s various implementations.
  • The way the script is implemented, it’s possible to have different knock sequences execute different playbooks. A “poor-man’s” method of differentiating hosts.
  • The Ansible script could be coupled the AWS API to get more information about the particular host it’s servicing. Imagine using a tag to set the “class” or “role” of the machine. The API could be used to look up that information about the host, and apply playbooks accordingly. This could also be done with variables – the values that are “punched in” when a playbook is run. This means one source of truth for configuration – just add the relevant bits to the right tags, and it just works.
  • I tested this approach with an auto scaling group, but I’ve used a trivial playbook and only launched 10 machines at a time – it would be a good idea to test this approach with hundreds of machines and more complex plays – my “free tier” t1.micro instance handled this “stampeding herd” without a blink, but it’s unclear how this really scales. If anyone gives this a try, please let me know how it went.
  • Custom callbacks could be used to enhance the script to send notifications when machines were launched, as well as more detailed logging.
Advertisements
This entry was posted in Uncategorized and tagged , , , , , . Bookmark the permalink.

13 Responses to Centralized Ansible Management With Knockd + Auto-provisioning with AWS

  1. Hmm…. This looks to be a lot of work to implement provisioning callbacks which are available here – see
    http://www.ansible.com/tower, which would be inexpensive at your scale, and contain a lot of logic about how to do EC2 API updates consistently without hammering the API or missing a host coming online.

    As for port knocking, it seems better to limit access by VPC + credentials.

    (Also be aware of license considerations)

    • jjmojojjmojo says:

      < 100 lines of code and a little config is hardly a lot of work :)

      All this proof of concept is doing is providing a way to run playbooks without requiring a workstation and a human operator. I'm not sure that "provisioning callbacks" are an accurate representation of what this does. The only reason why they're associated with the provisioning process is because of the use of user data to invoke the knock sequence to initiate running a playbook. There's no requirement that this has to happen at any specific time in the life cycle of the instance.

      Using port knocking has the advantage of not requiring credentials – it's as low-impact on the infrastructure as it possibly can be. However, there are other implementations of port knocking that include an exchange of keys, or some other secondary authentication mechanism, so if I had that concern, it could be remedied. But at that point, I might as well use a client-based configuration management tool, like Chef, Puppet or CFEngine.

      I will take a closer look at Tower for sure – but I wonder if it's really buying me anything compared to what I've done here. One of the things I dislike about other centralized tools, like Chef, is that you have to maintain parity between the machines themselves and the centralized orchestration hub (and in the case of Chef, you also have to keep things consistent between your source control and the hub as well) – I'm not sure if Tower fixes this (feel free to enlighten me), but having to maintain multiple sources of truth is tedious. This approach means I can keep all of my metadata in EC2; instead of EC2, my inventory file, and my orchestration tool. I've got another side project that does a similar thing with DNS – I'll be posting about that soon.

      Not sure what you mean by "license considerations" – can you elaborate?

      Thanks for the feedback, I appreciate it!

  2. Joseph Tate says:

    You could actually launch the instances using Ansible. I showed a way to do that at my TriLUG talk two weeks ago. I love being able to fire up instances (even spot instances), wait for them to start, connect, provision, run stuff, and then tear them down as a single “playbook”.

  3. Pingback: THIS WEEK IN AWS, APRIL 29 2014 – This Week In AWS | Amazon Web Services

  4. Why not use git to pull the latest playbook that uses local_action and ec2 module from within user_data? seems a lot less complicated. also to accept a new host key on launch you could do ssh -T -oStrictHostKeyChecking=no user@somehost.com

    • jjmojojjmojo says:

      The playbooks work the same way whether they’re being run automatically as described in the post, or in the typical manner. I also *want* there to be strict host key checking. I don’t want just any machine to be able to invoke the setup – this is the beauty of the Ansible model, using SSH as a transport. In any case, if I used local_action, the playbooks would only work in this situation, and not be reusable for other setups.

      I don’t know where you guys are getting complicated from this – it really couldn’t be simpler.

  5. dgmorales says:

    Interesting post. I’m a puppet user wondering about migrating to ansible, exactly because I want the “~zeroconfig” push model, yet I also wanna keep the pull model for other environments.

    But I would strongly recommend using a more complex knock port sequence. In this example, a simple port scan on your server will likely make it try to configure the machine scanning it, what could even be used to attack to you (get information on your setup, and maybe some sensitive information you provision).

    Use several port numbers out of order, mix udp and tcp ports, and you will be safer. Even better, there are some more secure alternatives to port knocking but I’ve never used them and don’t know if they can execute arbitrary commands like this (search for “alternative to port knocking” and “encrypted port knocking”).

    • jjmojojjmojo says:

      A port scan would not detect port knocking – there aren’t any ports in listening state, the concept uses a lower level mechanism. To figure anything out about my setup, you’d have to guess what ports I’m using, and then guess the sequence, by brute force. You’d also have to assume I was using port knocking, since there’s no way to tell from outside (that I’m aware of anyway). This approach has its roots in network security – as I mentioned in my post, it’s used to temporarily, and securely, open ports in firewalls.

      The thing to keep in mind: this is a closed network, it’s AWS – security groups are in place preventing communication from anywhere but an approved subnet – in my example this means anything within my VPC, in reality it could be a very specific subnet. I would hope that no one would consider using this methodology on an internet-facing IP.

      Also consider that the sequence doesn’t have to be set in stone – it could be dynamic, different for different kinds of machines, different depending on the day, or simply decided when the machine is provisioned. But again, this isn’t an issue, the security groups prevent any unauthorized attempts at even trying to figure out the sequence.

      In any case, if you try this out, please let me know how it goes – I haven’t had a chance to do this outside of the “lab” yet, and I’d love to hear how it works for other people.

      Good luck with the transition from Puppet – I think you’ll find Ansible quite comfortable :)

      P.s.
      Say this was open to the world, and someone managed to guess the knock sequence – it’s really OK – all that happens is an attempt at initiating the very earliest stages of an SSH connection – true, some detail about my server will be sent, but from what I understand of the SSH protocol, it probably wouldn’t be useful for much. There is some reverse engineering that can be done with a public key, but I’m not sure it would even get that far – and besides, if we were concerned about that, we could use very very strong encryption. All of this assuming, of course, that the attacker has the specific SSH port I’m using open with something listening on it.

      • dgmorales says:

        Well, indeed having some filtering (be it AWS security groups or whatever), and several or tweaks can get you pretty safe. And I got it that you wouldn’t use knocking just like this example in a production internet facing env, but anyway, for the sake of completing the discussion:

        Port knocking has its roots in the security community, but its also contested by many in there. With a port sequence so simple like two ascending numbers, the port scan itself performs the knocking: It will try to connect to tcp/9000 and then to tcp/9999 shortly after that. Knockd does not care that other 998 ports where also knocked in between, and all the other ones before and after (ok, I didn’t actually check knockd, but there are other common knocking configs that don’t). And since your knock is not opening some other unknown port, but instead trying to connect back, the scanner does not have to guess anything. It would see the port you are trying to connect to and could discover it’s a SSH login attempt.

        Then I believe the “attacker” could configure a sshd (maybe even some custom version of it) to always accept you connection, no matter what user/password or keys you use. And you are not checking the destination host key, so basically, it could just let you in.

        (Lots of IFs and MAYBEs, I know.)

      • ashleykreger says:

        So dgmorales does raise two very valid points, somebody could somehow come along, and accidentally hit whatever port knock sequence that “I as some server farmer may setup”… which is kind of security through obscurity, but still the odds are still kind of remote since they cannot detect that there is a port knocker there. Additionally, a remote server side sshd install could absolutely be modified to let somebody in without authentication truly taking place since the server just has to say “Yeah, your good”. (Enjoy having a heart attack when you happen upon an embedded infrastructure device that does this…(they do exist…))

        Personally, I’m much more a fan of having some sort of pre-shared secret burned into the image that would be booting, and then calling the remote server. Think pop-before-smtp kind of scripting. I think I just dated myself….

        In any event, what jjmojjmojo has put forth a great idea for a low overhead centralized pull, but any deployment should be modified to meet security requirements and controls that are appropriate to the situation. As a professional server farmer turned cloud builder, being too prescriptive about security upfront invites complications and headaches with different security models and requirements, although at the same time people tend not to think about security impacts. Personally I like big large red text with tags warning that people have to actually think about such things and implement appropriately.

        jj, is this on github yet? :)

      • jjmojojjmojo says:

        You’re totally right, but again – it’s in a VPC, and a specific subnet. There are port knocking implementations that utilize shared secrets and encryption keys, if that was necessary, but I’d argue that if it were, I have bigger problems* – remember we’re in the age (and again, its AWS) of “servers as cattle”. I would expect this entire infrastructure to be somewhat ephemeral, so the notion of someone coming along after me and accidentally using the system in a way it wasn’t intended doesn’t bother me – if I had this deployed, it would be *the way* that machines are built; it would be documented and well understood by any that followed me – if not, it would be torn down completely :)

        What I’m hearing here, and from dgmorales, is that I need to use a more complex knock sequence in my examples, and discuss the security implications – I’m planning a follow up to do just that, and try to explain the whole workflow in a more consumable way (this is not complicated, I don’t know why it’s being perceived that way). I also want to try to craft my own SSH server and see exactly what sort of information I can glean from a client connection attempt – my understanding of how SSH works makes me think that the risk there is still minimal, but it’s worth exploring to be sure. I’ll also do some port scanning experiments and see how that goes as well.

        Re: github – All of the code required to set this up is contained in that one blob in this post. It could be enhanced, but the bits that do all the work are about 30 lines – one call to the very well crafted Ansible API :) If it evolves much past where it is now, I’ll definitely set up a repo for it!

        * and at that point I might as well just use chef – this whole setup is to avoid running an open service and having to manage agents and certs

      • ashleykreger says:

        Well you know I run puppet at home. :)

        In any event, even though the desire is to treat servers like cattle in this day and age, the risks involved with an alien abducting one of the cattle are far more profound because there is always something that is secret… unless your running one of those websites that cannot be named. :)

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s