Cloud computing services such as AWS EC2 use SSH as the main authentication method for Linux servers. Even though the choice of SSH protocol is brilliant since it reduces the bureaucracy that is part of other authentication protocols, the current protocol implementation used by cloud providers (all just copied from Amazone’s EC2) is not good enough.

The way it’s currently done is like this:

In order to authenticate into a machine one needs a pem file containing a private key. The corresponding public key is made known to EC2 service (either uploaded by the user or generated by the service itself). When provisioning a machine, the relevant pem file is chosen which causes the service to place the public key inside the machine. After the SSHD daemon starts, users with the private key can login into the machine.

What’s wrong about this?

Managing the pem files (private keys) is problematic because of various reasons:

1. Insecure keys transfer. In order to provide access to more than one user into a running cluster, these pem files need to be transferred somehow. From my experience users send these private key pem files using emails or shared storage available to the entire company. This is a big security hole since once it’s transferred, it’s very hard to keep track of it. There is no confidence of who the key is exposed to.

2. Lost/stolen laptop. When someone loses his laptop with private key pem files, it usually starts a daunting risk analysis process: Everyone is trying to figure out which pem files were exposed, and how to revoke them. This is usually done under stress and with great uncertainty, and disturbs the rest of the people that still need to have access to the machines.

3. Lost pem files. This is rare but might happen: what happens if a private key pem file is lost and you still want to login to the machine?

The solution:

Use SSH with Certificate Authority (CA) to validate clients: One of the coolest features of SSH is the ability to use a CA to create client certificates. Using this method, a developer who needs access to a machine generates a private and public key, sends the public key to the administrator (notice that the private key never leaves the developer’s machine). The administrator generates a certificate from the public key and signs it using the CA private key, and sends the certificate back to the developer (the certificate contains no secret). The machines are configured to use the CA public key to validate the credentials and periodically update the revoke list. That’s it.

BTW, please don’t confuse SSH PKI and X509. SSH certificates are not X509. SSH has its own PKI implementation for some reason.

The advantages are:

1. There is no need to send secrets over non-secure channels.

2. If a user loses his laptop, his private keys are added to the revoke list, all the rest of the users are unaware of this incident. This method provides great confidence in minimizing the damage of these incidents.

3. The certificate can be issued with expiriation time, for a specific host, etc. which also reduces a risk in case of a hack.

How to do it?

We chose to use Hashicorp Vault to store the CA private key and to sign the users keys and generate a certificate. As of today Vault (version 0.8.2.1) is not a perfect solution for this, since it doesn’t manage the revocation list, but still its better than storing the private keys of the CA in a non-secure place, without audit.

You can just follow the instructions in: https://www.vaultproject.io/docs/secrets/ssh/signed-ssh-certificates.html

We found it useful to configure the servers in the following way:

We created a bucket and a path to store the CA public key and the revoke list (it contains no secret), we created a read only policy for this location and we added it to the IAM roles that we use. We added a service on all of the machines that is launched on startup. The service fetches the CA public key and it polls the revocation list every few hours. It’s better to start this service before SSH is starting just to be in the safe side.

The pem file of the CA can be accessed from vault, but we chose to place it in S3 since it doesn’t support the revocation list and because it’s a chicken and egg kind of thing. Usually the launch of Vault is somehow depends on SSH.

One small problem: We use terraform to handle the provisioning. Terraform (v0.9.2) connection stanza doesn’t know how to work with both private key and a certificate. To workaround this problem you should use ssh-agent in the shell that is going to provision. You need to add your private key and the certificate (using ssh-add command), and remove the private key from the connection stanza. This will cause “terraform apply” command to use the agent instead of trying to do the ssh by itself.

Hardening the ssh daemon

We just followed the recommendations in: https://ef.gy/hardening-ssh

SSH ports must never be open to the outside word

SSH ports must never be exposed to the internet. This can be achieved by using appropiate security groups with rules that blocks traffic to the SSH port, and allow them only from your static IP address. If you still need an access from a dynamic IP address, then anytime you want to login you can adjust the security group of your instance(s) to allow the IP address you are currently using. Don’t forget to delete the rule after you are done.

That’s it. I hope you will find this useful.