Follow @StolarekMarcin

Do you remember the 1st time you’ve lost access to server you manage? I remember mine the server was in location around 1 km from my office, so… I wasn’t very happy but still just going to the data center, connecting with old school PS2 keyboard and VGA monitor worked fine. In current era of virtualization and cloud services this issue has different flavor.

In case of virtual machine running on any popular hypervisor you always have the possibility to open virtual machine console. The situation is similar in AWS were you can get access to VM console over the web browser. Unfortunately, such a service is not available in Azure, so what other options we have to restore access to Azure VM? I’ll try to evaluate some options I was able to find on Internet.

One of the issues you may face is simply forgotten IP address. In this case you can check IP addresses with the following command:

$ az vm list-ip-addresses -o table VirtualMachine PublicIPAddresses PrivateIPAddresses ---------------- ------------------- -------------------- myVM1 52.166.38.246 172.31.11.4 myVM2 168.63.125.198 172.31.11.2

If you know the IP address, but you don’t know username or password – you may find this command helpful:

$az vm user update -n testVM -g testVM -u cinek

This will create user cinek if there is not such user on the VM and add public key of current user (from ~/.ssh) to authorized_keys. You can also specify different public key with --ssh-key-value or update password with -p newPassword . The execution took a few seconds, but ssh keys were updated and I was able to login.

Another issue may be that you misconfigured sshd, either disallowing access or causing the service failure on start because of syntax error. In this case you may restore default configuration ssh daemon with the help of

$az vm user reset-ssh -n testVM -gtestVM

Both az vm user commands rely on so called Virtual machine extension [1]. In case of Linux OS it’s implemented as service called waagent. Let’s check what will happen if we disable the service, with simple:

$systemctl stop waagent.service

And then start:

$ az vm user reset-ssh -n testVM -g testVM | Running ..

Surprisingly this ends up with command running for very long time. I even wrote forever, but after few hours I’ve noticed error message in my tmux window :

Deployment failed. Correlation ID: cbe501dd-1753-4a32-b75e-60d0f60068d9. Provisioning of VM extension 'enablevmaccess' has timed out. Extension installation may be taking too long, or extension status could not be obtained.

However, I’ve repeated the same command one more time and it ended successfully, but it didn’t change my /etc/ssh/sshd_config which I screwed up on purpose of the test. I didn’t check this further since its functionality is very specific. What if we for instance filter out ssh traffic on firewall? There is a way – we can execute a command on server without login with the help of:

$ az vm run-command invoke -g testVM -n testVM --command-id RunShellScript --scripts "id" { "endTime": "2018-04-22T15:35:36.523217+00:00", "error": null, "name": "e1f09a9d-2261-4556-a6ee-adcd20a56a43", "output": [ { "code": "ProvisioningState/succeeded", "displayStatus": "Provisioning succeeded", "level": "Info", "message": "Enable succeeded:

[stdout]

uid=0(root) gid=0(root) groups=0(root) context=system_u:system_r:unconfined_service_t:s0



[stderr]

" } ], "startTime": "2018-04-22T15:33:53.022994+00:00", "status": "Succeeded"

Let’s try simple accidental “ifconfig eth0 down” case and try to resume it by:

$ az vm run-command invoke -g testVM -n testVM --command-id RunShellScript --scripts "ifconfig eth0 up" - Running .. \ Running .. \ Running .. - Running .. - Running .. Get Token request returned http error: 400 and server response: {"error":"interaction_required","error_description":"AADSTS50076: Due to a configuration change made by your a$ ministrator, or because you moved to a new location, you must use multi-factor authentication to access '797f4846-ba00-4fd7-ba43-dac1f8f63013'.\r

Trace ID: 5195b7d3-c07d-464$ -b3fa-ed884ddb4700\r

Correlation ID: dfbec356-be5a-438d-8c6c-51ddff819f33\r

Timestamp: 2018-04-22 17:29:37Z","error_codes":[50076],"timestamp":"2018-04-22 17:29:37Z","trace$ id":"5195b7d3-c07d-4642-b3fa-ed884ddb4700","correlation_id":"dfbec356-be5a-438d-8c6c-51ddff819f33","suberror":"basic_action"}

The error was displayed after more than 1 hour of “Running ..”, currently I’m unable to execute any command, for instance:

$ az vm list Get Token request returned http error: 400 and server response: {"error":"interaction_required","error_description":"AADSTS50076: Due to a configuration change made by your ad ministrator, or because you moved to a new location, you must use multi-factor authentication to access '797f4846-ba00-4fd7-ba43-dac1f8f63013'.\r

Trace ID: ba855634-79dc-4a12 -b1df-4a9f4c4d3b00\r

Correlation ID: 43536441-31f5-4611-8095-6afb52830495\r

Timestamp: 2018-04-22 17:42:19Z","error_codes":[50076],"timestamp":"2018-04-22 17:42:19Z","trace_ id":"ba855634-79dc-4a12-b1df-4a9f4c4d3b00","correlation_id":"43536441-31f5-4611-8095-6afb52830495","suberror":"basic_action"}

az commands from remote server, while other commands (for the article) from my PC. This triggered session lock – good sercurity practice from MS Azure! After simple az logout; az login azure-cli connectivity was restored. When I published this post I didn’t understand what happened, however, it was my fault. The command was running for so long time, that I’ve started to work on something else and I issued one ofcommands from remote server, while other commands (for the article) from my PC. This triggered session lock – good sercurity practice from MS Azure! After simpleazure-cli connectivity was restored.

OK, let’s repeat the test – check command executon:



$ az vm run-command invoke -g testGroup -n testVM --command-id RunShellScript --scripts "ifconfig eth0 " - Running .. { "endTime": "2018-04-29T08:07:17.271407+00:00", "error": null, "name": "1f8a174b-7015-497e-b7ef-09039ca7b532", "output": [ { "code": "ProvisioningState/succeeded", "displayStatus": "Provisioning succeeded", "level": "Info", "message": "Enable succeeded:

[stdout]

eth0: flags=4163 mtu 1500

inet 172.30.16.13 netmask 255.255.255.128 broadcast 172.30.16.127

inet6 fe80::20d:3aff:fe28:a2cf prefixlen 64 scopeid 0x20

ether 00:0d:3a:28:a2:cf txqueuelen 1000 (Ethernet)

RX packets 3884232 bytes 4140284915 (3.8 GiB)

RX errors 0 dropped 0 overruns 0 frame 0

TX packets 4783646 bytes 659655479 (629.0 MiB)

TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0





[stderr]

" } ], "startTime": "2018-04-29T08:05:29.123762+00:00", "status": "Succeeded" }



Login to VM and disable the interface ssh 172.30.16.13 'ifconfig eth0 down' and try to enable it with the help of az vm run-command :



cinek-macbook-4:git-repo cinek$ az vm run-command invoke -g hpc-elk -n hpc-elk2 --command-id RunShellScript --scripts "ifconfig eth0 up" - Running .. - Running .. \ Running ..

What can I say? I’m happy that I didn’t lose access to any of my production Azure VMs since fixing it may not be that easy :). From all tested above vm run-commands sonuds like the best option for azure-cli user, but it requires IP connectivity on production interface, which is not the desired service design. In critical situation the best idea may be to go to portal.azure.com and use serial console connection available from the web – good for one VM, difficult to use for bigger deployment.

[1] https://docs.microsoft.com/en-us/azure/virtual-machines/windows/extensions-features