Blog /

Reactive VM Monitoring in GCP

Reactive VM Monitoring in GCP

 
At Scalabit, we develop and maintain both the application logic and Google Cloud Platform infrastructure for a service that enables one of our clients to run all their system regression tests at scale. This service is currently handling the lifecycle of 250.000+ ephemeral Compute Engine instances per day, so it is of paramount importance that all instances are in a valid state – a running instance that is not actively processing a test is wasted compute (and money).
 
With this in mind, we have a monitoring setup that allows us to detect when a given instance is in an unwanted/invalid state, and either automatically recover (if possible) or preemptively kill the instance to prevent wasting money.
 
In this article, we’ll share this simple but highly scalable setup for reactive monitoring in GCP, provide examples of unwanted behavior that can be detected and how to automatically act accordingly.
 

🏭 Infrastructure setup


 
On the topic of infrastructure, we leverage Cloud Monitoring Alerting Policies, Pub/Sub and CloudRun, as can be seen below. 
 
 
Monitoring Diagram
 
The interaction between the services is the following:
  • the service we wish to monitor (in this case Compute Engine instances) produces metrics that get pushed to Google Cloud Monitoring
  • we define alerting policies over those metrics, and setup a notification channel that pushes messages to a Pub/Sub topic whenever an alert is triggered
  • finally, we have a CloudRun service subscribed to the Pub/Sub topic, that will process the alert message and react accordingly
 
Throughout this article all examples will be based on monitoring execution of Compute Engine instances, but the same setup can be leveraged with all other services.
 

🏗 Terraform Snippets


Here at Scalabit we refuse to create infrastructure manually, so let’s have a look at how we can set this up using Terraform.
 
First let’s create our CloudRun service, as well as a dedicated Service Account for it:
 
resource "google_service_account" "cloudrun_service_account" {
  project      = var.project_id
  account_id   = "cloudrun-sva"
  display_name = "Service account for cloud run service"
}

resource "google_cloud_run_service" "alert_handler_cloudrun" {
  project  = var.project_id
  name     = "alert-handler-cr"
  location = "europe-west4"

  template {
    spec {
      service_account_name = google_service_account.cloudrun_service_account.email
      containers {
        image = var.container_image
        env {
          name  = "GCP_PROJECT_ID"
          value = var.project_id
        }
      }
    }
  }

  traffic {
    percent         = 100
    latest_revision = true
  }
}
 
Ignore the container image for now, we’ll go over it in a later section. Now let’s create the Pub/Sub topic that will receive the alert messages:
 
resource "google_pubsub_topic" "monitoring_alerts_topic" {
  project = var.project_id
  name    = "monitoring-alerts-topic"
}
 
Finally, let’s create the subscription that will push the messages from the topic onto our CloudRun service. We’ll also add a service account with invoker permissions on the CloudRun, and assign it to the subscription so that it has the required permissions.
 
resource "google_service_account" "cloudrun_invoker_account" {
project      = var.project_id
account_id   = "cloudrun-invoker-sva"
display_name = "Pubsub cloudrun invoker"
}

resource "google_cloud_run_service_iam_member" "invoker" {
project  = var.project_id
service  = google_cloud_run_service.alert_handler_cloudrun.name
location = google_cloud_run_service.alert_handler_cloudrun.location
role     = "roles/run.invoker"
member   = "serviceAccount:${google_service_account.cloudrun_invoker_account.email}"
}

resource "google_pubsub_subscription" "monitoring_alerts_subscription" {
  project = var.project_id
name  = "monitoring-alerts-subscription"
topic = google_pubsub_topic.monitoring_alerts_topic.name

push_config {
  push_endpoint = "${google_cloud_run_service.alert_handler_cloudrun.status[0].url}"
  oidc_token {
    service_account_email = google_service_account.cloudrun_invoker_account.email
  }
}
}

📝 Example Alerting Policies


Now that we have the infrastructure setup, let’s take a look at a few alerting policies with different levels of complexity.
 

🧠 1. Reacting on CPU Overload

If you have Google Ops Agent configured on your Compute Engine instances, you’ll get (amongst various other things) metrics regarding CPU utilization. Combining this with the default metrics produced regarding the uptime of an instance, we can define a policy that sends an alert whenever an instance’s CPU usage is above 95% for more than 5 minutes in a row, and ensure it only gets applied to instances that have been running for longer than 10 minutes. This can be done with the following terraform snippet:
 
resource "google_monitoring_alert_policy" "cpu_overload_alert" {
project      = var.project_id
display_name = "Too much CPU usage Alert Policy"
  combiner     = "OR"

conditions {
  display_name = "CPU usage above threshold for 3minutes"
  condition_prometheus_query_language {
    query     = <<-EOT
    (min_over_time(compute_googleapis_com:instance_cpu_utilization{monitored_resource="gce_instance"}[5m]) > 0.95)
        and on(instance_id)
   (compute_googleapis_com:instance_uptime_total{monitored_resource="gce_instance"} > 600)
    EOT
    evaluation_interval = "30s"
  }
  }

alert_strategy {
  auto_close  = "180s"
  }

notification_channels = [   google_monitoring_notification_channel.alert_pubsub_topic.id
]
}

⏳ 2. Reacting on Killed Processes

Once again this requires that you have Google Ops Agent configured on your Compute Engine instances, as it provides metrics regarding the processes that are running inside the instance. We can define an alerting policy that sends an alert whenever a process called `antivirus` has **not** been running for more than 3 minutes inside a given instance.
 
resource "google_monitoring_alert_policy" "listener_not_running" {
project      = var.project_id
display_name = "Tetragon Listener not running Alert Policy"
  combiner     = "OR"

conditions {
    display_name = "Tetragon Listener is not running for 3 minutes"

  condition_prometheus_query_language {
      query     = <<-EOT
(max_over_time(compute_googleapis_com:instance_uptime_total{monitored_resource="gce_instance"}[3m]) > 180)
      unless on(instance_id)
      (sum_over_time(agent_googleapis_com:processes_cpu_time{monitored_resource="gce_instance",command="antivirus"}[3m]) > 0)
    EOT
    evaluation_interval = "30s"
  }
  }

alert_strategy {
  auto_close  = "180s"
  }

notification_channels = [   google_monitoring_notification_channel.alert_pubsub_topic.id
]
}

🚑 3. Reacting on Unauthorized Operations

Now this one is a bit more complex.
 
First we need Tetragon running on our Compute Engine instances, and configure it to use a policy that blocks access to a given file:
 
apiVersion: cilium.io/v1alpha1
kind: TracingPolicy
metadata:
name: "block-file-open"
spec:
kprobes:
  - call: "security_file_open"
    syscall: false
    args:
      - index: 0
        type: file
    returnArg:
      index: 0
      type: file
    selectors:
      - matchArgs:
          - index: 0
            operator: "Equal"
            values:
              - "/DO_NOT_OPEN"
        matchActions:
            - action: Sigkill

We’ll not go in-depth intro Tetragon here, as we’ll cover it in a different Scalabit article. What you need to know for now is that with the above policy, Tetragon will block any attempt at opening the file `/DO_NOT_OPEN` and that it exposes a gRPC server that allows you to monitor all Tetragon events.
 
With this, we can create a small service that listens to these events, and sends a custom Google Cloud Monitoring metric whevener an event is detected:
 
import (
"context"
"fmt"
"log"
"os"
"time"
monitoring "cloud.google.com/go/monitoring/apiv3/v2"
monitoringpb "cloud.google.com/go/monitoring/apiv3/v2/monitoringpb"
tetragon "github.com/cilium/tetragon/api/v1/tetragon"
metricpb "google.golang.org/genproto/googleapis/api/metric"
"google.golang.org/genproto/googleapis/api/monitoredres"
"google.golang.org/grpc"
"google.golang.org/protobuf/types/known/timestamppb"
)

func main() {
ctx, cancel := context.WithCancel(context.Background())
defer cancel()
metricsClient, err := monitoring.NewMetricClient(ctx)
if err != nil {
log.Println("failed to initialize metrics client " + err.Error())
os.Exit(2)
}
defer metricsClient.Close()
conn, err := grpc.NewClient("unix:///var/run/tetragon/tetragon.sock")

if err != nil {
return
}

client := tetragon.NewFineGuidanceSensorsClient(conn)
stream, err := client.GetEvents(ctx, &tetragon.GetEventsRequest{})

for {
event, err := stream.Recv()
if err != nil {
log.Println("error on receive " + err.Error())
continue
}

if event.Event != nil {
kprobeEvent, ok := event.Event.(*tetragon.GetEventsResponse_ProcessKprobe)
if !ok {
continue
}
switch kprobeEvent.ProcessKprobe.FunctionName {
case "security_file_open":
err = pushUnauthorizedActionAttempt(ctx, metricsClient, instanceId, instanceName, instanceZone)
if err != nil {
log.Println("failed to push metric " + err.Error())
}
default:
continue
}
}
}
}
 
The above code does the following:
  • initializes the GCP metrics client that will be used to push our custom metrics
  • initializes a gRPC client to listen for Tetragon events
  • continuously poll for events, and if any match the `security_file_open` name that we assigned to our Tetragon policy, we push a custom metric
The function for pushing the custom metric can be found below:
 
func pushUnauthorizedActionAttempt(ctx context.Context, client *monitoring.MetricClient, instanceId, instanceName, instanceZone string) error {
now := time.Now()

req := &monitoringpb.CreateTimeSeriesRequest{
Name: "projects/" + GCPProjectID,
TimeSeries: []*monitoringpb.TimeSeries{
{
Metric: &metricpb.Metric{
Type: "custom.googleapis.com/tetragon/unauthorized_action_attempt",

Labels: map[string]string{
"source":        "tetragon-listener",
"instance_name": instanceName,
},
},
Resource: &monitoredres.MonitoredResource{
Type: "gce_instance",
Labels: map[string]string{
"project_id":  GCPProjectID,
"instance_id": instanceId,
"zone":        instanceZone,
},
},

Points: []*monitoringpb.Point{
{
Interval: &monitoringpb.TimeInterval{
EndTime: timestamppb.New(now),
},
Value: &monitoringpb.TypedValue{
Value: &monitoringpb.TypedValue_Int64Value{
Int64Value: 1,
},
},
},
},
},
},
}

if err := client.CreateTimeSeries(ctx, req); err != nil {
return fmt.Errorf("failed to write metric: %w", err)
}
return nil
}

The main things to note here are:
  • the `type` for a custom metric must always start with `custom.googleapis.com`
  • when integrating with an existing type of monitored resource we need to provide the exact resource labels it requires (in our case for the `gce_instance` resource, these are `project_id`, `instance_id` and `zone`)
  • any additional metadata/information that we want to attach to the metric, can be passed via the metric’s labels
 
We can then create a systemd service to execute our `tetragon-listener` application:
 
[Unit]
Description=Tetragon Listener Service
After=tetragon.service

[Service]
ExecStart=/bin/tetragon-listener
Restart=always
User=root

[Install]
WantedBy=multi-user.target
 
With this in place, the `custom.googleapis.com/tetragon/unauthorized_action_attempt` metric that we’ve defined will be pushed anytime the Tetragon event is detected.
 
So we can finally configure our alerting policy do detect any Compute Engine instance that tries to access that file.
 
resource "google_monitoring_alert_policy" "unauthorized_access_alert" {
project      = var.project_id
display_name = "Unauthorized Access Alert Policy"
combiner     = "OR"
conditions {
  display_name = "more than two unauthorized access attempts"
  condition_prometheus_query_language {
    query     = <<-EOT
      (sum_over_time(custom_googleapis_com:tetragon_unauthorized_action_attempt{monitored_resource="gce_instance"}[1m]) > 0)
      and on(instance_id)
      (sum_over_time(compute_googleapis_com:instance_uptime{monitored_resource="gce_instance"}[1m]) > 0)
    EOT
    evaluation_interval = "30s"
  }
}

alert_strategy {
  auto_close  = "180s"
}

notification_channels = [
  google_monitoring_notification_channel.alert_pubsub_topic.id
]
}
 

🚨The Alert Handler CloudRun Service


Now that we have our infrastructure ready to go, as well as some alerting policies configured, let’s have a look at the code that our CloudRun service will be executing.
 
First, it’ll be listening for incoming HTTP requests:
func main() {
http.HandleFunc("/", handler)
if err := http.ListenAndServe(":8080", nil); err != nil {
log.Fatal(err)
}
}
 
Since the requests are triggered by our PubSub subscription, the body of the requests will always contains a `Data` and an `Attributes` field. We’ll be able to unmarshall it into the following type:
 
type PubSubMessage struct {
Message struct {
Data       []byte            `json:"data"`
Attributes map[string]string `json:"attributes"`
} `json:"message"`
}
 
And the `Data` itself will contain the alert information, which we can unmarshal into:
 
type AlertPayload struct {
Incident struct {
AlertingPolicyName string `json:"policy_name"`
Metric             struct {
Labels map[string]string `json:"labels"`
} `json:"metric"`

Resource struct {
Labels map[string]string `json:"labels"`
Type   string
} `json:"resource"`
} `json:"incident"`
}

Then, in our request handler, we will:
  • read the body of the request
  • unmarshall it into our `AlertMessage` type
  • log the information about which alerting policy was triggered, and which Compute Engine instance caused the alert
  • and finally, delete the offending instance
 
import (
"context"
"encoding/json"
"fmt"
"io"
"log"
"net/http"
"os"
"strings"
"google.golang.org/api/compute/v1"
"google.golang.org/api/option"
)

func handler(w http.ResponseWriter, r *http.Request) {
projectId := os.Getenv("GCP_PROJECT_ID")
ctx := context.Background()
computeService, err := compute.NewService(ctx, option.WithScopes(compute.ComputeScope))

if err != nil {
log.Printf("failed to create compute service: %v", err)
http.Error(w, "failed to create compute service", http.StatusBadRequest)
return
}

body, err := io.ReadAll(r.Body)
if err != nil {
log.Printf("failed to read body: %v", err)
http.Error(w, "failed to read body", http.StatusBadRequest)
return
}
defer r.Body.Close()

var message PubSubMessage
if err := json.Unmarshal(body, &message); err != nil {
log.Printf("failed to unmarshal pubsub message: %v", err)
http.Error(w, "failed to unmarshall pubsub message", http.StatusBadRequest)
return
}

var alertPayload AlertPayload
if err := json.Unmarshal(message.Message.Data, &alertPayload); err != nil {
log.Printf("failed to unmarshall incident payload: %v", err)
http.Error(w, "failed to unmarshall incident payload", http.StatusBadRequest)
return
}

alert := alertPayload.Incident
log.Printf("'%s' triggered for '%s'(%s) in zone '%s'\n",
alert.AlertingPolicyName,
alert.Metric.Labels["instance_name"],
alert.Resource.Labels["instance_id"],
alert.Resource.Labels["zone"],
)

_, err = computeService.Instances.Delete(projectId, alert.Resource.Labels["zone"], alert.Metric.Labels["instance_name"]).Do()

if err != nil && !strings.Contains(err.Error(), "not found") {
log.Printf("failed to delete instance: %s", err.Error())
http.Error(w, "failed to delete instance", http.StatusBadRequest)
}

fmt.Fprint(w, "OK")
}
 
Note that any **2XX** response returned by our CloudRun will automatically be treated as acknowledging the Pub/Sub message.
 
And also note that we can easily have different reactive measures for different policies by filtering on `alert.AlertingPolicyName`.
 
Finally, we can then build and release a container with this application using GoReleaser, so that it can be used by our CloudRun service:
version: 2

builds:
- id: monitorvm
  main: ./cloudrun/main.go
  binary: monitorvm
  env:
    - CGO_ENABLED=0
  goos:
    - linux
  goarch:
    - amd64

dockers:
- goos: linux
  goarch: amd64
  dockerfile: ./cloudrun/Dockerfile
  ids:
    - monitorvm
  image_templates:
    - "{{ .Env.DOCKER_REPOSITORY }}:{{ .Version }}"
  skip_push: true
  build_flag_templates:
    - "--label=org.opencontainers.image.created={{ .Date }}"
    - "--label=org.opencontainers.image.title={{ .ProjectName }}"
    - "--label=org.opencontainers.image.revision={{ .FullCommit }}"
    - "--label=org.opencontainers.image.version={{ .Version }}"
    - "--platform=linux/amd64"
 

🔚 Conclusion


When it comes to metrics-based alerts in GCP the sky is the limit. When paired with the infrastructure solution presented in this article it enables reactive monitoring capabilities, while being simple to implement and capable of scaling to handle hundreds of thousands of services being monitored.