How we build and operate Keboola data platform
Ondřej Popelka 8 min read

Automating Self Hosted Pipelines — Part 2

Building an Azure Devops pipeline to build an Azure Devops pipeline.

Automating Self Hosted Pipelines — Part 2

Automating Self-Hosted Pipelines — Part 2

In the first part of the article, I described how I automate the worker virtual machine setup using a Bash script. Now I’ll describe how to automate that Bash script using Azure Pipelines.

Having gone through the first part of the journey, I now have a Bash script which takes two environment variables — a token and a worker name — and sets up a new Azure pipeline worker. That’s cool but I still have to create the token, write the instructions how to run the script, store the script somewhere, run the script somewhere, etc. Still some room for automation and repeatability.

More Automation

So, how about I build an Azure pipeline which creates an Azure pipeline worker?

Sure.

We have a dedicated Azure subscription for CI services (like a testing Kubernetes cluster, or resources needed to run tests of applications). That makes it convenient for this case too. I.e., I can authorize a pipeline to the CI subscription knowing that it won’t break anything except the CI in the worst case.

Let’s recap a bit. I have the virtual machine automated by an ARM template. I have the configuration automated by a bash script. I need to run that bash script on that machine.

An obvious solution is to open an SSH port, log in and run the script there. And that’s a big “no” for me, because it’s completely unnecessary for the worker to have any open ports to the outside world. In the rare case I need to visit the worker, the Serial console suffices.

Luckily it’s possible to run scripts as part of the virtual machine deployment. The two most suitable solutions are Run command and Custom Scripts extension. The Run command can be used in this case, but the script has to be present on the target machine. This would be an extra moving part for me. I’d either have to create the command so that it self-downloads or create a custom VM image (another moving part). A custom script extension takes care of this. It has a limitation — that the script size cannot exceed 256 KB, which is more than plenty for me. The walkthrough is nice, and that’s pretty much all I needed.

Custom script extension can be placed as a sub-resource of the virtual machine. Roughly like this:

{
            "name": "[variables('resourceName')]",
            "type": "Microsoft.Compute/virtualMachines",
            "apiVersion": "2020-12-01",
            "location": "[resourceGroup().location]",
            "tags": "[variables('resourceTags')]",
            "dependsOn": [...],
            "properties": {
                "hardwareProfile": {
                    "vmSize": "Standard_D2s_v3"
                },
                ... snip ...
            },
             "resources": [
                {
                    "name": "[concat(variables('resourceName'),
                                 'startup')]",
                    "type": "Extensions",
                    "location": "[resourceGroup().location]",
                    "apiVersion": "2019-03-01",
                    "tags": {
                        "displayName": "Startup script"
                    },
                    "properties": {
                        "publisher": "Microsoft.Azure.Extensions",
                        "type": "CustomScript",
                        "typeHandlerVersion": "2.1",
                        "protectedSettings": {
                            "commandToExecute": 
                              "[concat('bash test.sh ',    
                               parameters('patToken'), 
                               ' ', variables('resourceName'))]",
                            "fileUris": ["https://XXX/test.sh"]
                        }
                    }
                }            
            }
        },

But then the error messages are highly unreadable (everything mangled on one line). So, I resorted to running it separately using the az vm extension set command.

script_content=$(cat startup.sh | gzip -9 | base64 -w 0)
az vm extension set \
  --publisher "Microsoft.Azure.Extensions" \
  --name "CustomScript" \
  --version "2.0" \
  --ids "XXXXXX" \
  --protected-settings "{\"script\":\"$script_content\"}"

I also realized that I can directly pass the script code so it doesn’t have to be downloaded on the target machine. It just needs to be base64 encoded and optionally gzipped (also presented nicely in the official docs).

Setting up a pipeline

With custom scripts extension I’m almost there, I just have to put the parts together. The worker can then be deployed using the following script (did I say I really try to avoid bash?):

az group create \
  --subscription "$SUBSCRIPTION" \
  --location "East US" \
  --name "$WORKER_NAME" \
  --tags "purpose=azure-pipeline" "workerName=$WORKER_NAME"

az deployment group create \
    --subscription "$SUBSCRIPTION" \
    --resource-group "$WORKER_NAME" \
    --name "$WORKER_NAME" \
    --template-file ./template.json \
    --parameters \
        adminUsername="$ADMIN_USERNAME" \
        adminPassword="$ADMIN_PASSWORD"

vmId=$(
  az deployment group show \
    --subscription "$SUBSCRIPTION" \
    --resource-group "$WORKER_NAME" \
    --name "$WORKER_NAME" \
    --query "properties.outputs.vmId.value" \
    --output tsv
)

script_content=$(cat startup.sh | gzip -9 | base64 -w 0)

printf "\nRunning Startup script"
az vm extension set \
  --publisher "Microsoft.Azure.Extensions" \
  --name "CustomScript" \
  --version "2.0" \
  --ids "$vmId" \
  --protected-settings "{\"script\":\"$script_content\"}"

The script creates a resource group, deploys the virtual machine (and its accessories). Then it gets the id of the created machine, processes the script (encode & gzip) and runs the custom extension with the script.

Creating a resource group for each worker is a little bit of overkill. I decided to do it so that I can remove the worker deployment easily. Unlike AWS Cloudformation, Azure deployments don’t delete the resources created by them, which is hugely annoying. Luckily, I have a dedicated subscription available, so it’s not a huge issue.

Almost done.

Authentication

What remains is the authentication. I use PAT to register the worker because the other authentication methods seem too complex. I started with creating my PAT token in the UI and fed that using environment variables to the code. This is annoying because the token should expire and also it’s my personal secret (of Arabia).

Luckily, it is also possible to use the authorization of the pipeline job itself. While the job authorization is quite complex with many options, I ended up using the System.accessToken built-in environment variable. The system access token authorizes the job as a corresponding “build service”. The build service initially has no access to the agent pool so it is not able to add workers to it. This can be fixed by assigning the respective permissions in the agent pool (“Default” in my case).

I’m not jumping for joy about giving the job an administrative role, but the project in which the pipeline job runs is a separate project with only two pipelines to create and delete the worker. So I accepted it (at least before I manage to have managed identity up and running or use the Personal access token API which just became available).

Now the question is how to get the PAT to the script inside the machine. The custom script extension doesn’t have any variables. But luckily, some bash magic can solve it:

envsubst '${PAT_TOKEN} ${$WORKER_NAME}' < startup.sh > startup-replaced.sh
script_content=$(cat startup-replaced.sh | gzip -9 | base64 -w 0)

That essentially bakes the token (and the worker name) into the script right before it is run. I’m sending the script content inside the protected-settings parameter of the vm extension set command, so it gets encrypted.

The thing to remember about envsubst is that if you don’t list the parameters, it will completely cripple the bash script, because it’ll replace all variables in it. Don’t even ask how much time it took before it occurred to me.

With this I have another bash script ready (did I say I try to avoid bash?) that deploys the Azure resources and runs the initializing script on the machine (which installs and configures the agent). The final version can be seen here.

I can now put this into an azure-pipelines.yml definition:

pr: none
trigger: none
pool:
  vmImage: ubuntu-latest
stages:
  - stage: deploy
    displayName: 'Deploy Worker'
    jobs:
      - job: deployWorker
        steps:
          - task: AzureCLI@2
            inputs:
              azureSubscription: 'XXXXX'
              scriptType: 'bash'
              scriptLocation: 'scriptPath'
              scriptPath: 'run-startup.sh'
            env:
              SUBSCRIPTION: $(SUBSCRIPTION)
              WORKER_NAME: $(WORKER_NAME)
              ADMIN_USERNAME: $(ADMIN_USERNAME)
              ADMIN_PASSWORD: $(ADMIN_PASSWORD)
              PAT_TOKEN: $(System.AccessToken)

Then I just configure a new build pipeline with the above definition.

Hurdles

The VM setup bash script needs to restart the machine, because it changes GRUB. But restarting the machine needs to be done “asynchronously”, so that the custom script has a chance to finish successfully (and consequently, the deployment finishes successfully). A one-minute delay is sufficient:

sudo shutdown -r +1 "Rebooting."

Without the delay the deployment fails (and yet everything is set up correctly). That took me a while to figure out. The delay means that the pipeline worker behaves slightly strangely — it’s up when the pipeline finishes, then it goes down and up again. That could be a problem if something else was hooked to the end of the pipeline run (e.g., starting tests on that worker), but I’m not doing that.

Having the custom script as a sub-resource of the VM resource seems convenient at first, but the errors are really unreadable. Often I needed to go to the Azure Portal UI to read the logs:

Which still is not the hallmark of readability:

Also, when the script is part of the VM resource, it takes much longer to execute the deployment (logically, because all the VM properties are checked). Having the custom script run separately from the VM template proved to be useful for faster iterations.

I also added a couple of safeguards to the script. The worker name (and resource group name) is entered as a user-overridable parameter of the pipeline, so I added a check:

if [[ "$WORKER_NAME" != "pipeline-"* ]] ;
then
   printf "'%s' is not a valid worker name (must match 'pipeline-*')." "$WORKER_NAME"
   exit 1
fi

to make sure that the pipeline does not interfere with some random resource group so as not to inadvertently change something unwanted. I also added a check for an already running worker to the setup script:

if [[ -d "/datadrive" ]]
then
    echo "Worker is already running"
    exit 1
fi

This simply checks if my custom directory is mounted. While primitive, it takes care of many possible failure scenarios. Generally, it allows retrying the VM creation part and avoids retrying anything that breaks when the machine is already running.

And last — even short scripts inside the pipeline YAML are a huge pain. There is no way to reasonably debug the script. The printf command doesn’t seem to work properly (echo does). But in either case, any mistake means to trigger the pipeline again, wait and see some other awkward error. In general, each debugging iteration is really slow. So, the best solution I found was to wrap everything in bash scripts (did I mention I try to avoid coding in bash?).

Shutdown

While I was at it, I also created a pipeline to remove a worker. This is pretty easy — use a Custom script extension to run a bash script (did I mention that I really try to avoid bash scripts?) that uninstalls the agent service and deregisters the worker.

#!/usr/bin/env bash
 set -Eeuo pipefail
printf "\nUninstall Agent"
 cd /home/testadmin/azagent
 # intentionally allow both things to fail
 sudo ./svc.sh uninstall || true
 printf "cd /home/testadmin/azagent && ./config.sh remove --unattended --auth pat --token $PAT_TOKEN" > ./wrap.sh
 sudo chmod a+x ./wrap.sh
 runuser -l testadmin -c '/home/testadmin/azagent/wrap.sh'
printf "\nFinished successfully"

Intentionally, I allow both things to fail. The worst that can happen is that I’ll be left with a dangling worker in the agent pool.

Then I need to run this script in the VM. For that I can again use the Custom script extension. It’s useful to have the custom script as a separate resource to the VM itself as before.

vmId=$(
   az deployment group show \
     --name "$WORKER_NAME" \
     --resource-group "$WORKER_NAME" \
     --subscription "$SUBSCRIPTION" \
     --query "properties.outputs.vmId.value" \
     --output tsv
 )
envsubst '${PAT_TOKEN}' <  shutdown.sh > shutdown-replaced.sh
 script_content=$(cat shutdown-replaced.sh | gzip -9 | base64 -w 0)
az vm extension set \
   --publisher "Microsoft.Azure.Extensions" \
   --name "CustomScript" \
   --version "2.0" \
   --ids "$vmId" \
   --protected-settings "{\"script\":\"$script_content\"}"
printf "Deleting deployment %s\n" "$WORKER_NAME"
az group delete \
   --name "$WORKER_NAME" \
   --subscription "$SUBSCRIPTION" \
   --yes

The bash script takes the worker name and PAT token environment variables, runs the custom script to deregister the agent and then deletes the entire resource group. Then I can run this bash script from a build pipeline. It’s very similar to the startup pipeline. All the code is available in the repository.

The End

The remove pipeline requires entering a worker name, so I made that an empty variable with allowed user override. I have to set it before running the pipeline (the script checks so):

The create pipeline has a default value, so it is now a matter of one click to create a self-hosted pipeline worker.

If you liked this article please share it.

Comments ()

Read next

MySQL + SSL + Doctrine

MySQL + SSL + Doctrine

Enabling and enforcing SSL connection on MySQL is easy: Just generate the certificates and configure the server to require secure…
Ondřej Popelka 8 min read