How we build and operate the Keboola data platform
Michal Kozák 6 min read

Silence is Golden: Filtering GitHub Notifications the Hard Way

Silence is Golden: Filtering GitHub Notifications the Hard Way

In Keboola, we are slowly moving away from using Azure DevOps for our CI/CD needs, with GitOps taking over the CD part, and GitHub Actions the CI one. Apart from the actual application side of things, we also deploy all of our infrastructure using IaC, whether it's ARM, CloudFormation, or Terraform. And thanks to the GitOps effort, we moved the infrastructure pipeline applying Terraform code for all our GCP resources to GitHub Actions. It's in every single possible way superior to Azure DevOps.

Except the notifications.

We're running dozens of infrastructure pipelines daily, so to cut out all the noise, we want to be notified in Slack only when something does fail. In Azure DevOps, that is fairly simple to setup, but oddly enough not in GitHub. In standard Slack integration, you'll get notifications for everything, with very little filtering possibilities available. With no status filter, of course. There was a now-deleted issue older than 2 years with hundreds of thumbs up, but that was about it. There are other ways how to do it, but you're letting some third party to read your GitHub contents, plus it isn't exactly free.

So why not do it yourself? How hard can it be?

The plan

With the problem clearly defined, I needed a strategy that balanced simplicity with effectiveness.

I started small. All I wanted was getting a Slack message into public channel, whenever a pipeline would fail in our infrastructure repository. It would check pipelines for default branch only, and it would show a link to get me quickly to where the problem is. No configuration, one repo, one branch, one Slack channel. The application itself should be as light-weight and as cheap to run as possible. I don't want to babysit it at all. Fire and forget.

I don't care much about performance, I need something easily maintainable and understandable.

And how do I make this app to check my pipelines? Repositories I care about are actually private, requiring some kind of authentication to inspect them. I don't want to have a Personal Access Token lying somewhere on the Internet, or creating a GitHub App for that.

Good ol' webhooks! You can tell your repository to send webhook to an address whenever a pipeline finishes. The payload is fairly extensive, but there's the same problem like with the official Slack integration - it always sends the webhook, whether the pipeline was successful, or not, or cancelled.

So I laid out fairly straightforward plan:

  1. Create a Python script, that parses this payload, form it into a nice Slack message, and send it to Slack's webhook.
  2. Have this Python script run as a Flask app in Google Cloud Functions.
  3. Implement a proper CI/CD.

The journey

Time to get my hands dirty with the actual implementation.

First hurdle was getting the actual payload of the webhook. GitHub provides a reference docs, but there is no place where I could find a complete example of it. I could either have a dummy webhook receiver hosted publicly somewhere, but you can do it halfway - create a repository webhook with scopes you want and all, and then just put some random address as the webhook receiver. The actual sending will fail, but you will be able to inspect the entire payload in the repository Settings.

Once I had my 350-lines long JSON, I started slicing it up. I was interested in:

  • The pipeline (Was it GCP infrastructure? Was it AWS infrastructure pipeline?)
  • The job (Which step of the pipeline failed?)
  • The commit (What changed)
  • The environment (Was it prod? Was it dev?)
  • Duration (How long the pipeline ran before failing?)
  • Links to repository, job, pipeline, commit, and environment

And most importantly, I was interested only in payloads carrying failed pipelines.

Did I also mention that GitHub lets you know when the pipeline is queued or in-progress too? Yeah, it does that. So you need to check only for completed pipelines.

So I landed on these conditions to process the payload:

conditions = [
    event_data["action"] == "completed",
    event_data["conclusion"] == "failure",
    default_branch_check,
    repository_owner == "keboola",
]

This will be evaluated every time a JSON is received, and only if all is true, it will send a Slack message.

What is default_branch_check? GitHub doesn't tell you, whether the pipeline ran on a default branch of your repository. It will tell you on which branch it ran, and which branch is the default one, and you need to match them.

Checking repository_owner is just a precaution - the URL of the webhook receiver is not really guessable, but it is still public, so this would at least block someone nefarious from spamming our Slack.

What you can also (and should) do is provide an optional secret when configuring the webhook in GitHub. This will be used to create a hash, which is sent in each payload header, and you can validate it on your side.

Now I had everything I needed, except duration, which the JSON payload doesn't provide. But it provides started_at and completed_at, so it required just a little fun with time math. There was more of this (there's no link to a commit, but you have a commit itself, and repository name, so you just glue it together; same for link to a environment).

The hard part

Having the logic sorted out was just the beginning - now came the infrastructure challenges.

Now I had a Python script, that could take a GitHub payload, check its validity, check whether it is for a failed pipeline, and parse from it information I cared about. How do I get it into my team's Slack channel now?

I need to have the script somewhere running to receive the webhooks, and I need a Slack integration of my own to send proper messages.

For Slack, I created a simple app, that doesn't serve any real purpose than as a gateway into our company's Slack. You get a token that belongs to this app, which you can use to call Slack API.

headers = {
    "Content-Type": "application/json; charset=utf-8",
    "Authorization": f"Bearer {SLACK_BOT_TOKEN}",
}
data = {"channel": slack_channel_id, "text": message}
response = requests.post(
    "https://slack.com/api/chat.postMessage",
    headers=headers,
    json=data,
    timeout=60,
)

For running the Python script, I made it into a Google Cloud Function. You basically need to wrap it into a Flask app, and you're good to go. I wanted it as simple as possible, no Docker, no nothing, just serve this code please.

That opened a can of worms though, since I had to provision it whole through pipeline. So I did what I know best, and implemented it all using Terraform. You need Google Storage Bucket for the code, you need the Google Cloud Function itself, and whole bunch of IAMs around it (my favorite part is that to make the Function publicly accessible, i.e. invokable, you need to add role run.invoker to literally allUsers).

resource "google_cloud_run_service_iam_member" "public_access" {
  location = google_cloudfunctions2_function.github_actions_slack_app_function.location
  service  = google_cloudfunctions2_function.github_actions_slack_app_function.name
  role     = "roles/run.invoker"
  member   = "allUsers"
}

If you have paid attention, you might have realised that I do have actually two secrets I need to worry about. The Slack token to authorize all my Slack API calls, and GitHub webhook secret to validate the payload. These needed to be loaded by the Python script, so I simply put them manually into Google Secrets Manager, granted the Service Account running the Function secretmanager.secretAccessor, and have the Python itself fetch it through Google API:

client = secretmanager.SecretManagerServiceClient()
name = f"projects/{project_id}/secrets/{secret_name}/versions/latest"

try:
    response = client.access_secret_version(name=name)
    return response.payload.data.decode("UTF-8")
except Exception as e:
    raise ValueError(f"Error retrieving secret {secret_name}: {str(e)}") from e

To complement all this, a GitHub pipeline using OIDC simply runs the Terraform in target GCP project, and deploys it in seconds. No real magic there.

The ride

And that's pretty much it. It runs on absolutely minimal resources, costs us around USD 0.3 monthly, and shows us exactly what we want! And we can always change it! Perfect!

The Python app now also supports overriding the default branch if you want to monitor any other, and it can load a config file with repositories and branches to monitor, and which channel to message for each of them. This is still done through the repo itself, where the config JSON lives, and people can simply open a PR with their repository and Slack channel. The pipeline takes care of the rest.

What I learned: Sometimes the best solutions are the simple ones you build yourself. While third-party integrations exist, creating a custom solution gave us exactly the control and filtering we needed without compromising security or introducing yet another subscription fee.

Would I do it again? Absolutely. The entire project took maybe a day to implement and has saved us countless hours of notification noise. It's been rock solid ever since, secure (it only receives and isn't authorized against GitHub in any way), and now I just take it for granted as it quietly chugs along in its own GCP project.

If you liked this article please share it.

Comments ()