What To Do About A Broken BBL Bosh

what is bbl?

bbl - the bosh bootloader - is a convenient utility to bootstrap Cloudfoundry BOSH in any major IaaS. The resulting 2-VM bosh director deployment (jumpbox used as a bastion host and the bosh director itself) can be used to deploy and manage bosh releases, most notably Concourse.

Default behavior: the problem

When bbl terraforms the infrastructure for bosh, it generates a CA and signes a certificate that's valid for 365 days. When the year goes by, the user can no longer communicate with the bosh director and bosh itself cannot talk to VMs it manages and continues to reboot them in a vicious cycle.

When your bosh certificate expires, you're likely to see the following error message:

$ bosh vms
Fetching info:
Performing request GET 'https://10.0.0.6:25555/info':
Performing GET request:
Retry: Get https://10.0.0.6:25555/info: x509: certificate has expired or is not yet valid
Exit code 1

bbl comes with a rotate subcommand, but it's misleading - it only rotates the jumpbox and leaves the user with a false sense of security.

$ bbl rotate
step: terraform init
step: terraform apply
step: creating jumpbox
Deployment manifest: '/tmp/bbl/jumpbox-deployment/jumpbox.yml'
Deployment state: '/tmp/bbl/vars/jumpbox-state.json'
Started validating
Downloading release 'os-conf'... Skipped [Found in local cache] (00:00:00)
Validating release 'os-conf'... Finished (00:00:00)
Downloading release 'bosh-aws-cpi'... Skipped [Found in local cache] (00:00:00)
Validating release 'bosh-aws-cpi'... Finished (00:00:00)
Validating cpi release... Finished (00:00:00)
Validating deployment manifest... Finished (00:00:00)
Downloading stemcell... Skipped [Found in local cache] (00:00:00)
Validating stemcell... Finished (00:00:00)
Finished validating (00:00:00)
Started installing CPI
Compiling package 'ruby-2.4-r4/0cdc60ed7fdb326e605479e9275346200af30a25'... Finished (00:03:48)
Compiling package 'bosh_aws_cpi/2efde942607c6ec5b12673499b2bb32e424fffdcbf3e02589578de38c658dc76'... Finished (00:00:03)
Installing packages... Finished (00:00:00)
Rendering job templates... Finished (00:00:01)
Installing job 'aws_cpi'... Finished (00:00:00)
Finished installing CPI (00:03:54)
Uploading stemcell 'bosh-aws-xen-hvm-ubuntu-xenial-go_agent/456.40'... Skipped [Stemcell already uploaded] (00:00:00)
Started deploying
Waiting for the agent on VM 'i-05f52f801e269d8df'... Failed (00:00:10)
Deleting VM 'i-05f52f801e269d8df'... Finished (00:00:36)
Creating VM for instance 'jumpbox/0' from stemcell 'ami-030ad6ac815c68380 light'... Finished (00:01:03)
Waiting for the agent on VM 'i-001e559d5eb08e0b3' to be ready... Finished (00:00:26)
Rendering job templates... Finished (00:00:01)
Updating instance 'jumpbox/0'... Finished (00:00:04)
Waiting for instance 'jumpbox/0' to be running... Finished (00:00:00)
Running the post-start scripts 'jumpbox/0'... Finished (00:00:00)
Finished deploying (00:02:30)
Cleaning up rendered CPI jobs... Finished (00:00:00)
Succeeded
step: created jumpbox
step: creating bosh director
Deployment manifest: '/tmp/bbl/bosh-deployment/bosh.yml'
Deployment state: '/tmp/bbl/vars/bosh-state.json'
Started validating
Downloading release 'bosh'... Skipped [Found in local cache] (00:00:00)
Validating release 'bosh'... Finished (00:00:00)
Downloading release 'bpm'... Skipped [Found in local cache] (00:00:00)
Validating release 'bpm'... Finished (00:00:00)
Downloading release 'bosh-aws-cpi'... Skipped [Found in local cache] (00:00:00)
Validating release 'bosh-aws-cpi'... Finished (00:00:00)
Downloading release 'os-conf'... Skipped [Found in local cache] (00:00:00)
Validating release 'os-conf'... Finished (00:00:00)
Downloading release 'uaa'... Skipped [Found in local cache] (00:00:00)
Validating release 'uaa'... Finished (00:00:01)
Downloading release 'credhub'... Skipped [Found in local cache] (00:00:00)
Validating release 'credhub'... Finished (00:00:00)
Validating cpi release... Finished (00:00:00)
Validating deployment manifest... Finished (00:00:00)
Downloading stemcell... Skipped [Found in local cache] (00:00:00)
Validating stemcell... Finished (00:00:00)
Finished validating (00:00:04)
No deployment, stemcell or release changes. Skipping deploy.
Succeeded
step: created bosh director
step: generating cloud config
step: applying cloud config
step: applying runtime config

Impact

With default settings, you will face some downtime. BOSH's resurrector will keep killing your VMs thinking there is something wrong with the agent. In the worst case scenario, you might be forced to repave your entire infrastructure, as commented by a user on github:

QuoteWe recently failed to rotate the CA cert for our bbl-deployed director. We failed, then had to tear down all of our infrastructure and rebuild it. Would be great if bbl natively rotated CA certs.... because I don't think destroying all current infrastructure is an option for many teams.

Prevention

Use the following command to verify the freshness of the BOSH certificate:

$ bbl ssh --director --cmd "curl -vvvk https://10.0.0.6:25555/info"

The command will produce output similar to the one below:

checking host key
running:
ssh -T jumpbox@23.21.16.181 -i /var/folders/60/0309s59d0dzb6j72b2tzdy0m0000gn/T/310904643/jumpbox-private-key echo host key confirmed
Unauthorized use is strictly prohibited. All access and activity
is subject to logging and monitoring.
host key confirmed
opening a tunnel through your jumpbox
starting:
ssh -4 -D 57083 -nNC jumpbox@23.21.16.181 -i /var/folders/60/0309s59d0dzb6j72b2tzdy0m0000gn/T/310904643/jumpbox-private-key
executing command on director:
curl -vvvk https://10.0.0.6:25555/info
Unauthorized use is strictly prohibited. All access and activity
is subject to logging and monitoring.
running:
ssh -tt -o StrictHostKeyChecking=no -o ServerAliveInterval=300 -o ProxyCommand=nc -x localhost:57083 %h %p -i /var/folders/60/0309s59d0dzb6j72b2tzdy0m0000gn/T/310904643/director-private-key jumpbox@10.0.0.6 curl -vvvk https://10.0.0.6:25555/info
Unauthorized use is strictly prohibited. All access and activity
is subject to logging and monitoring.
* Trying 10.0.0.6...
* Connected to 10.0.0.6 (10.0.0.6) port 25555 (#0)
* found 148 certificates in /etc/ssl/certs/ca-certificates.crt
* found 740 certificates in /etc/ssl/certs
* ALPN, offering http/1.1
* SSL connection using TLS1.2 / ECDHE_RSA_AES_128_GCM_SHA256
* server certificate verification SKIPPED
* server certificate status verification SKIPPED
* common name: 10.0.0.6 (matched)
* server certificate expiration date OK
* server certificate activation date OK
* certificate public key: RSA
* certificate version: #3
* subject: C=USA,O=Cloud Foundry,CN=10.0.0.6
* start date: Wed, 22 Jul 2020 17:44:01 GMT
* expire date: Thu, 22 Jul 2021 17:44:01 GMT
* issuer: C=USA,O=Cloud Foundry,CN=ca
* compression: NULL
* ALPN, server accepted to use http/1.1
> GET /info HTTP/1.1
> Host: 10.0.0.6:25555
> User-Agent: curl/7.47.0
> Accept: */*
>
< HTTP/1.1 200 OK
< Server: nginx
< Date: Fri, 24 Jul 2020 04:59:59 GMT
< Content-Type: application/json
< Content-Length: 605
< Connection: keep-alive
< X-Content-Type-Options: nosniff
<
* Connection #0 to host 10.0.0.6 left intact
{"name":"bosh-bbl-env-qinghai-2020-07-22t17-34z","uuid":"0752df76-6f59-4865-924d-451e0ae8b959","version":"270.7.0 (00000000)","user":null,"cpi":"aws_cpi","stemcell_os":"ubuntu-xenial","stemcell_version":"456.40","user_authentication":{"type":"uaa","options":{"url":"https://10.0.0.6:8443","urls":["https://10.0.0.6:8443"]}},"features":{"local_dns":{"status":true},"power_dns":{"status":false,"extras":{"domain_name":"bosh"}},"compiled_package_cache":{"status":false,"extras":{"provider":null}},"snapshots":{"status":false},"config_server":{"status":true,"extras":{"urls":["https://10.0.0.6:8844/api/"]}}}}
Connection to 10.0.0.6 closed.

It's a good idea to rotate bosh director every three months. @poligraph has kindly suggested a workaround in this Github issue.

Were you affected by the bosh certificate expiry fiasco?

If you are affected by this problem, please reach out - we'll be happy to salvage your bosh deployments even after the 1 year has elapsed. Going forward, we'll set up your infrastructure to prevent similar fiascos going forward.