The wild world of certificates

Naor Matania
6 min readOct 12, 2021

Most people don’t fully understand the mechanics of certificates. Most just get the basics and can live a full life of engineering without learning further. But sometimes shit happens and you need to dig deeper.

Recently my team started getting reports that our website failed to load due to an expired certificate. At a first glance, it seemed like there was nothing wrong that could cause that problem — our certificate would only expire in another 2 months (!!). Secondly, the problem wasn’t consistent. For example, we couldn’t reproduce it in our office while other people from the team who were working abroad at the time constantly encountered it. This suggested a caching issue. But what specifically is being cached and how can we fix it ?

It turns out that the same issue we faced had also impacted big tech companies like Slack and Shopify, taking down their websites (check out this news report). So don’t fool yourself that this is only some super rare and arbitrary issue that can never happen to you ;)

Before we dig into the problem and how we solved it, let’s go over a quick (and very partial) intro to certificates and how we issued ours in the first place. Hang tight !

Certificates 101

Websites can serve traffic via the https protocol which wraps the http protocol with TLS (the successor of now-deprecated SSL). TLS requires a certificate that helps the device verify the server identity.

The certificate is built with a public key, the server identity (DNS address in most cases) and signature of the server identity that can be verified with the Issuer’s (called CA - certificate authority) public key. The certificate will be signed by a single third party CA or a chain of CAs where the root CA is a trusted entity by devices around the world (and its certificate is shipped in our devices).

So how does it makes our website more secure ?

If another entity will pretend to serve google.com it won’t be able to create a valid certificate that signs on google.com. This way we assure we communicate with the real identity we wished to communicate with and not with any attacker. Certificates also expire at some time and need to be renewed in order to make sure the private key that matches the public key in certificate is not cracked by an attacker somehow.

I don’t know about you but I feel safer surfing on the web after learning about the powerful security of TLS :)

SSL Certificate details

How we issued our certificate

Our product is built on GCP (Google Cloud Platform). GCP, like all other cloud platforms, has a product that issues managed certificates for your website domain. Using managed certificate is great because the cloud provider takes care of issuing a new certificate for you when your the current certificate is about to expire or if there’s a risk that it got compromised.

Unfortunately, we couldn’t use a managed certificate since we wanted our certificate to sign on a wildcard domain name which wasn’t supported by GCP. Our product is a B2B product that serves multiple customers and we designed it in such way that each customer gets their own sub domain. The easiest way for your website to serve traffic for customer_a.mysite.com, customer_b.mysite.com etc. is to have a single certificate that signs on *.mysite.com.

After looking at our options we decided to use the Let’s Encrypt service to sign on our certificate. Let’s Encrypt is a free service that uses the ACME protocol for issuing certificates. The protocol verifies that you are the owner of the domain that you wish to sign on.

After we used this protocol with Let’s Encrypt, we got a certificate and a private key (that matches the public key in the certificate) that we promptly handed to our https load balancer for managing the https traffic against devices. The certificate had an expiration time so once it was about to expire, we planned on using the ACME protocol again to create a new certificate with an extended expiration date.

We automate the creation of our infrastructure using Terraform. Terraform is a leading product for managing infrastructure as code and you should consider using it if you’re not already (Really ! It’s a life changer !). If you’re unfamiliar with it, stay tuned for a post about it that I plan to publish soon.

In Terraform, we use an awesome provider that runs the ACME protocol with many different CAs (including Let’s Encrypt) and many DNS services (including GCP Cloud DNS) and eventually produces a signed certificate for us.

So what was the problem ?

So after all of the introduction, it’s time to finally debug our issue !

When we looked at the certificates of the two browsers (connected to the internet from different corners of the globe), we noticed that the problem was with the intermediate certificate R3 (of Let’s Encrypt) which shows different expirations dates. But how can the same certificate appear to be expired at different times ?

Two browsers — two results

The reason for it is that those are two different certificates. When we uploaded the certificate to our load balancer, we only uploaded our server certificate and not the full chain (which is composed of our certificate and the chain of signing certificates up to the root certificate).

The browser is responsible for looking up and creating the full chain. Different browsers might each implement this mechanism differently. Some can preload intermediate certificates and some can read the AIA field in a certificate which is a URL that points to the issuer certificate. For more details I recommend reading this post.

But how is it possible that two different certificates signed our certificate ?

The mechanism for it is called cross-signing. We first must understand that the CA doesn’t sign our certificate with its certificate but with a private key that only it has. The public key matching this private key is found inside its certificate which allows the browser to verify it is the signer. In order for Lets Encrypt (or any other CA) to deal with expiration of its own certificate (or revoking for some reason) it can create a new certificate for its identity with the same public key. This way, the new certificate is also valid.

The specifics of why one browser was able to retrieve the new certificate and the other one could only show the old certificate is beyond the debugging I did. It’s probably related to caching of the certificate URL somewhere along the way in the internet, but can also be caused by an incomplete implementation of the browser that doesn’t try to look for additional certificates when there are multiple options.

Ok we got it.. And the fix is ??

Since some clients weren’t able to create the valid certificate chain, our fix was to let our load balancer to serve the full chain certificate and not just our server certificate.

This change only required a one-line change (Terraform rocks !):

Fix in Terraform — use full chain certificate

And that’s it ! Problem solved :)

Please note that the solution presented might not be the best one. Even though in our case our certificate expires way before the intermediate certificate, there might be other scenarios in which the intermediate certificate becomes invalid, and the full-chain approach may interrupt the browser’s logic for finding the most suitable chain. If you find a better solution, please let me know in the comments below.

Conclusions

Certificates are hard !

There are many different moving parts — browser implementations, server caches, different CA implementations, but I hope this post helps to uncover some of the mystery.

--

--

Naor Matania

Software Engineer, Ideas Explorer, Product builder — “The best way to predict the future is to create it” (A. Lincoln)