Packer and NAT: Netops Hack For The Rest Of Us

I’ve recently been doing a lot of interesting stuff with HashiCorp tools, specifically Packer and Terraform. If you’re not aware of these tools you can check them out: here for Packer, and here for Terraform. I hope to be writing a fair bit about these tools and some of the other stuff Hashicorp is up to in the next little bit since there is some very cool stuff going on here!

 

 

Before I go any further, two notes:

1) I am not an expert on these toolsets so bear with me :)!

2) I’m doing some things with these tools that is maybe a smidge outside of their intended purpose — they are generally speaking “systems” tools not “networking” tools but there is obviously tons of bleed over between “systems” and “networking” so those lines are a bit blurred, and I’m just going to do cool stuff with whatever tools I can get my hands on… so buckle up!

I’m currently using Packer to roll Cisco CSR1000v and someUbuntu baseline images for some lab scenarios (hopefully more on what the actual labs are in the future too — even cooler stuff there). Packer basically consumes a .json file that describes the type of image you are building, what assets you are using to create said image, and all sorts of other relevant data. I won’t get too much into that as the documentation is generally pretty solid. In my case, I’ve been using VMware Vsphere to actually create the images — Packer basically deploys a VM based on my .json file off to my vCenter, and then exports that image via OVFTool (under the covers, you don’t even have to know about OVFTool which is kinda cool all by itself). In a vacuum this works fantastically well. Here is a simple example of how that may look (apologies for bad WordPress formatting):

{
  "builders": [
    {
      "vm_name": "ignw_netops_front_end",
      "type": "vmware-iso",
      "guest_os_type": "other", 
      "format": "vmx",
      "iso_url": "assets/csr1000v-universalk9.03.12.03.S.154-2.S3-std.iso",
      "iso_checksum_type": "md5",
      "iso_checksum": "a92df894bd57af6cd33b715b6cf0ffe8",
      "remote_host": "{{user `esxi_host`}}",
      "remote_datastore": "{{user `esxi_datastore`}}",
      "remote_username": "{{user `esxi_user`}}",
      "remote_password": "{{user `esxi_password`}}",
      "remote_type": "esx5",
      "vnc_disable_password": true,
      "disk_size": "8192",
      "disk_type_id": "zeroedthick",
      "disk_adapter_type": "scsi",
      "vmx_data": {
        "memsize": "8192",
        "numvcpus": "2",
        "ethernet0.virtualDev": "vmxnet3",
        "ethernet0.networkName": "{{user `csr_prtgrp`}}",
      },
     "boot_wait": "4m",
     "keep_registered": true
    }
  ]
}

This shouldn’t look *too* crazy even if you’ve never used Packer. Basically, we’re pointing to the CSR1000v ISO (yes I know that is an older version!) and to where we want to deploy it (ESX blah), and configuring some basic VM settings (disk, CPU, memory, ethernet0). The double squiggly brackets are me bringing in variables from a .json file — a handy way to templatize some stuff you’ll use over and over again. So, so far so good!

Before the above can work though, there is one more little hack that Packer needs — we need to enable the “Guest IP Hack” — check out Nick Charlton’s blog post on that here(no affiliation, just a super helpful blog post!). Here is a super ugly (but functional!) expect script that you can run prior to your Packer builds to enable that on your ESX host:

#!/usr/bin/expect

set username [lindex $argv 0];
set host [lindex $argv 1];
set password [lindex $argv 2];

spawn ssh -oStrictHostKeyChecking=no $username@$host
expect "*assword: "
send "$password\r"
sleep 1
expect "*:~] "
send "esxcli system settings advanced set -o /Net/GuestIPHack -i 1\r"
expect "*:~] "
exit

Here is where things get/got interesting. With all that out-of-the-way Packer can now build your VM, and interrogate VMware to get the IP address (allegedly?). Packer will connect to the VM (via the IP address that it learns about from VMware) to validate all is well and to execute any additional commands that you’ve asked it to (install packages, configure stuff, etc. — none of this shown here). My CSR has no IP address (ok it does in some stuff I omitted, but not a *reachable* IP address), so Packer does some interesting stuff in the logs:

2018/04/29 21:44:30 ui: ==> carl_csr: Waiting for SSH to become available...
2018/04/29 21:44:30 packer: 2018/04/29 21:44:30 [DEBUG] Opening new ssh session
2018/04/29 21:44:30 packer: 2018/04/29 21:44:30 [DEBUG] starting remote command: esxcli --formatter csv network vm list
2018/04/29 21:44:30 packer: 2018/04/29 21:44:30 [DEBUG] Opening new ssh session
2018/04/29 21:44:30 packer: 2018/04/29 21:44:30 [DEBUG] starting remote command: esxcli --formatter csv network vm port list -w 200228
2018/04/29 21:44:32 packer: 2018/04/29 21:44:32 [INFO] Attempting SSH connection...
2018/04/29 21:44:32 packer: 2018/04/29 21:44:32 [DEBUG] reconnecting to TCP connection for SSH
2018/04/29 21:44:32 packer: 2018/04/29 21:44:32 [DEBUG] handshaking with SSH
2018/04/29 21:44:33 packer: 2018/04/29 21:44:33 [DEBUG] handshake complete!
2018/04/29 21:44:33 packer: 2018/04/29 21:44:33 [INFO] no local agent socket, will not connect agent
2018/04/29 21:44:33 ui: ==> carl_csr: Connected to SSH!
2018/04/29 21:44:33 packer: 2018/04/29 21:44:33 Running the provision hook
2018/04/29 21:44:33 ui: ==> carl_csr: Forcibly halting virtual machine...

Weird! It seems that Packer just decided that its VNC connection (what it uses to do the initial provisioning, again not shown, just take my word for it I guess) is good enough and it’ll turn down the router and export it. All good… for now….

Here comes the wonky part of what I’m doing (and something that kinda goes against “normal” usage of these tools) is that I’m rolling these images behind a NAT. And “these” images includes things that are *not* a CSR, like regular Ubuntu boxes. Packer has access to the ESX box(es), but no direct access to the VM(s) once it(they) is(are) brought up. This basically breaks Packer because it can never complete its tasks because it can’t connect. I’m not totally clear *why* Packer is perfectly happy to not connect via SSH to the CSR, but it is very clear that it does want to connect to Ubuntu boxes via SSH as we can see here in the log output:

2018/04/29 22:01:12 packer: 2018/04/29 22:01:12 Connection refused when connecting to: X.X.X.X
2018/04/29 22:01:12 packer: 2018/04/29 22:01:12 [DEBUG] Error getting SSH address: No interface on the VM has an IP address ready
2018/04/29 22:01:14 ui error: ==> carl_jenkins: Timeout waiting for SSH.
2018/04/29 22:01:14 packer: 2018/04/29 22:01:14 [DEBUG] SSH wait cancelled. Exiting loop.
2018/04/29 22:01:14 ui: ==> carl_jenkins: Step "StepConnect" failed

A bit of searching around seemed to indicate that you can basically provide SSH data in your .json file so that Packer knows credentials and IPs and all that fun stuff. So I added the following:

 "ssh_username": "carl",
 "ssh_password": "carl",
 "ssh_wait_timeout": "1m",
 "ssh_port": 1234,
 "ssh_host": "10.10.10.10",

Running it again, it looks like nothing has changed….

2018/04/29 22:01:12 packer: 2018/04/29 22:01:12 Connection refused when connecting to: X.X.X.X
2018/04/29 22:01:12 packer: 2018/04/29 22:01:12 [DEBUG] Error getting SSH address: No interface on the VM has an IP address ready 
2018/04/29 22:01:14 ui error: ==> carl_jenkins: Timeout waiting for SSH. 
2018/04/29 22:01:14 packer: 
2018/04/29 22:01:14 [DEBUG] SSH wait cancelled. Exiting loop. 
2018/04/29 22:01:14 ui: ==> carl_jenkins: Step "StepConnect" failed

Weirdly it doesn’t even seem to try to connect to the IP that I provided, just the IP that it gleaned from ESX. The “fix” is super obvious in retrospect, however the documentation does not make it immediately clear (at least to me, maybe I’m slow?! It is Sunday afternoon and I do have a beer handy after all!): add the “communicator” argument to the .json file. In

"communicator": "ssh",

This kinda blew my mind since it was REALLY obviously already using SSH… and I had already passed the right info to it… but whatever, it’s working now in my silly non-standard setup. And the log shows what we would hope it should:

2018/04/29 22:13:12 packer: 2018/04/29 22:13:12 [DEBUG] TCP connection to SSH ip/port failed: dial tcp 10.10.10.10:1234: connect: connection refused
2018/04/29 22:13:17 packer: 2018/04/29 22:13:17 [INFO] Attempting SSH connection...
2018/04/29 22:13:17 packer: 2018/04/29 22:13:17 [DEBUG] reconnecting to TCP connection for SSH
2018/04/29 22:13:17 packer: 2018/04/29 22:13:17 [DEBUG] handshaking with SSH
2018/04/29 22:13:17 packer: 2018/04/29 22:13:17 [DEBUG] handshake complete!
2018/04/29 22:13:17 packer: 2018/04/29 22:13:17 [INFO] no local agent socket, will not connect agent
2018/04/29 22:13:17 ui: ==> carl_jenkins: Connected to SSH!

Hopefully this will save somebody some headache later!