What To Do About A Broken BBL Bosh
what is bbl
?
bbl
- the bosh bootloader - is a convenient utility to bootstrap
Cloudfoundry BOSH in any major IaaS. The resulting 2-VM bosh director
deployment (jumpbox used as a bastion host and the bosh director itself) can be used to deploy and manage bosh releases, most notably Concourse.
Default behavior: the problem
When bbl terraforms the infrastructure for bosh, it generates a CA and signes a certificate that's valid for 365 days. When the year goes by, the user can no longer communicate with the bosh director and bosh itself cannot talk to VMs it manages and continues to reboot them in a vicious cycle.
When your bosh certificate expires, you're likely to see the following error message:
$ bosh vmsFetching info:Performing request GET 'https://10.0.0.6:25555/info':Performing GET request:Retry: Get https://10.0.0.6:25555/info: x509: certificate has expired or is not yet validExit code 1
bbl
comes with a rotate
subcommand, but it's misleading - it only
rotates the jumpbox and leaves the user with a false sense of security.
$ bbl rotatestep: terraform initstep: terraform applystep: creating jumpboxDeployment manifest: '/tmp/bbl/jumpbox-deployment/jumpbox.yml'Deployment state: '/tmp/bbl/vars/jumpbox-state.json'Started validatingDownloading release 'os-conf'... Skipped [Found in local cache] (00:00:00)Validating release 'os-conf'... Finished (00:00:00)Downloading release 'bosh-aws-cpi'... Skipped [Found in local cache] (00:00:00)Validating release 'bosh-aws-cpi'... Finished (00:00:00)Validating cpi release... Finished (00:00:00)Validating deployment manifest... Finished (00:00:00)Downloading stemcell... Skipped [Found in local cache] (00:00:00)Validating stemcell... Finished (00:00:00)Finished validating (00:00:00)Started installing CPICompiling package 'ruby-2.4-r4/0cdc60ed7fdb326e605479e9275346200af30a25'... Finished (00:03:48)Compiling package 'bosh_aws_cpi/2efde942607c6ec5b12673499b2bb32e424fffdcbf3e02589578de38c658dc76'... Finished (00:00:03)Installing packages... Finished (00:00:00)Rendering job templates... Finished (00:00:01)Installing job 'aws_cpi'... Finished (00:00:00)Finished installing CPI (00:03:54)Uploading stemcell 'bosh-aws-xen-hvm-ubuntu-xenial-go_agent/456.40'... Skipped [Stemcell already uploaded] (00:00:00)Started deployingWaiting for the agent on VM 'i-05f52f801e269d8df'... Failed (00:00:10)Deleting VM 'i-05f52f801e269d8df'... Finished (00:00:36)Creating VM for instance 'jumpbox/0' from stemcell 'ami-030ad6ac815c68380 light'... Finished (00:01:03)Waiting for the agent on VM 'i-001e559d5eb08e0b3' to be ready... Finished (00:00:26)Rendering job templates... Finished (00:00:01)Updating instance 'jumpbox/0'... Finished (00:00:04)Waiting for instance 'jumpbox/0' to be running... Finished (00:00:00)Running the post-start scripts 'jumpbox/0'... Finished (00:00:00)Finished deploying (00:02:30)Cleaning up rendered CPI jobs... Finished (00:00:00)Succeededstep: created jumpboxstep: creating bosh directorDeployment manifest: '/tmp/bbl/bosh-deployment/bosh.yml'Deployment state: '/tmp/bbl/vars/bosh-state.json'Started validatingDownloading release 'bosh'... Skipped [Found in local cache] (00:00:00)Validating release 'bosh'... Finished (00:00:00)Downloading release 'bpm'... Skipped [Found in local cache] (00:00:00)Validating release 'bpm'... Finished (00:00:00)Downloading release 'bosh-aws-cpi'... Skipped [Found in local cache] (00:00:00)Validating release 'bosh-aws-cpi'... Finished (00:00:00)Downloading release 'os-conf'... Skipped [Found in local cache] (00:00:00)Validating release 'os-conf'... Finished (00:00:00)Downloading release 'uaa'... Skipped [Found in local cache] (00:00:00)Validating release 'uaa'... Finished (00:00:01)Downloading release 'credhub'... Skipped [Found in local cache] (00:00:00)Validating release 'credhub'... Finished (00:00:00)Validating cpi release... Finished (00:00:00)Validating deployment manifest... Finished (00:00:00)Downloading stemcell... Skipped [Found in local cache] (00:00:00)Validating stemcell... Finished (00:00:00)Finished validating (00:00:04)No deployment, stemcell or release changes. Skipping deploy.Succeededstep: created bosh directorstep: generating cloud configstep: applying cloud configstep: applying runtime config
Impact
With default settings, you will face some downtime. BOSH's resurrector will keep killing your VMs thinking there is something wrong with the agent. In the worst case scenario, you might be forced to repave your entire infrastructure, as commented by a user on github:
Prevention
Use the following command to verify the freshness of the BOSH certificate:
$ bbl ssh --director --cmd "curl -vvvk https://10.0.0.6:25555/info"
The command will produce output similar to the one below:
checking host keyrunning:ssh -T jumpbox@23.21.16.181 -i /var/folders/60/0309s59d0dzb6j72b2tzdy0m0000gn/T/310904643/jumpbox-private-key echo host key confirmedUnauthorized use is strictly prohibited. All access and activityis subject to logging and monitoring.host key confirmedopening a tunnel through your jumpboxstarting:ssh -4 -D 57083 -nNC jumpbox@23.21.16.181 -i /var/folders/60/0309s59d0dzb6j72b2tzdy0m0000gn/T/310904643/jumpbox-private-keyexecuting command on director:curl -vvvk https://10.0.0.6:25555/infoUnauthorized use is strictly prohibited. All access and activityis subject to logging and monitoring.running:ssh -tt -o StrictHostKeyChecking=no -o ServerAliveInterval=300 -o ProxyCommand=nc -x localhost:57083 %h %p -i /var/folders/60/0309s59d0dzb6j72b2tzdy0m0000gn/T/310904643/director-private-key jumpbox@10.0.0.6 curl -vvvk https://10.0.0.6:25555/infoUnauthorized use is strictly prohibited. All access and activityis subject to logging and monitoring.* Trying 10.0.0.6...* Connected to 10.0.0.6 (10.0.0.6) port 25555 (#0)* found 148 certificates in /etc/ssl/certs/ca-certificates.crt* found 740 certificates in /etc/ssl/certs* ALPN, offering http/1.1* SSL connection using TLS1.2 / ECDHE_RSA_AES_128_GCM_SHA256* server certificate verification SKIPPED* server certificate status verification SKIPPED* common name: 10.0.0.6 (matched)* server certificate expiration date OK* server certificate activation date OK* certificate public key: RSA* certificate version: #3* subject: C=USA,O=Cloud Foundry,CN=10.0.0.6* start date: Wed, 22 Jul 2020 17:44:01 GMT* expire date: Thu, 22 Jul 2021 17:44:01 GMT* issuer: C=USA,O=Cloud Foundry,CN=ca* compression: NULL* ALPN, server accepted to use http/1.1> GET /info HTTP/1.1> Host: 10.0.0.6:25555> User-Agent: curl/7.47.0> Accept: */*>< HTTP/1.1 200 OK< Server: nginx< Date: Fri, 24 Jul 2020 04:59:59 GMT< Content-Type: application/json< Content-Length: 605< Connection: keep-alive< X-Content-Type-Options: nosniff<* Connection #0 to host 10.0.0.6 left intact{"name":"bosh-bbl-env-qinghai-2020-07-22t17-34z","uuid":"0752df76-6f59-4865-924d-451e0ae8b959","version":"270.7.0 (00000000)","user":null,"cpi":"aws_cpi","stemcell_os":"ubuntu-xenial","stemcell_version":"456.40","user_authentication":{"type":"uaa","options":{"url":"https://10.0.0.6:8443","urls":["https://10.0.0.6:8443"]}},"features":{"local_dns":{"status":true},"power_dns":{"status":false,"extras":{"domain_name":"bosh"}},"compiled_package_cache":{"status":false,"extras":{"provider":null}},"snapshots":{"status":false},"config_server":{"status":true,"extras":{"urls":["https://10.0.0.6:8844/api/"]}}}}Connection to 10.0.0.6 closed.
It's a good idea to rotate bosh director every three months. @poligraph has kindly suggested a workaround in this Github issue.
Were you affected by the bosh certificate expiry fiasco?
If you are affected by this problem, please reach out - we'll be happy to salvage your bosh deployments even after the 1 year has elapsed. Going forward, we'll set up your infrastructure to prevent similar fiascos going forward.