Sre | O11y | Observability

Opentelemetry Filelog Example

Filelog receiver is used to collect logs from the server where otelcol-contrib is running. In otelcol-contrib v111, there are several bugs in transform and some other processors causing log processing to fail. Here is working example to extract and set service_name (as label in Grafana) from the log files following specif pattern. Logs are coming from different apps and in the server they are following the rule: /some/path/log/services/<service_name/*.log We will extract <service_name> from directory path and set it as resource.attribute. Loki and Grafana will use this is label and it will show up in Grafana’s drop down for filtering. ...

K8s Monitoring Servicemonitor

K8s-monitoring can use prometheus-operator CRDs to discover services for monitoring and scraping. Here are few steps to help troubleshoot your setup. Enable Prometheus-Operator in Helm # @section -- Features - Prometheus Operator Objects prometheusOperatorObjects: # -- Enable gathering metrics from Prometheus Operator Objects. # @section -- Features - Prometheus Operator Objects enabled: true This feature is using alloy-metrics component, so it must be enabled, too. Deploy app with exposed metrics I used bitnami/nginx to test the setup. ...

Flux Gitops in Gitlab

GitOps is mode of operation where you keep Kubernetes manifests in a repo as source of truth and you use some way to sync that to your cluster. There are several ways and tools you can use, ie. ArgoCD or Flux. Gitlab is using Flux by default. I believe it is more lightweight than ArgoCD and requires les resources in your cluster. This is why it is more suitable for small clusters like k3s or microk8s. You may want to use it if you do not need fancy UI. If you have big team of developers and they do not want to invest any effort to understand k8s logic, use ArgoCD. ...

Gitlab Agent Cicd

Gitlab is using 2 different ways to manage and provision resources in your Kubernetes clusters. You can use GitOps way running FluxCD or you can connect your cluster to Gitlab using gitlab-agent and use kubectl commands directly in your .gitlab-ci.yml. If you decide to use gitlab-agent it will install additional POD into your cluster using Helmto keep 2-way communication between cluster and Gitlab. In your Gitlab repo go to section Operate -> Kubernetes clusters and create the new cluster. Save the agentID. ...

Automated Test for Opentelemetry Deployment

The biggest chalenge in building Observability platform for big company is how to make strict policy for matrics labeling and automate configuration of Otel Collectors config files. This could be resolved by using some fleet management solution. Until today, I was not able to find full working open source solution, so you may need to create your own. However you resolve the above issue, it is good practice to have automated check after your mass deployment to see what servers picked up new config and sending signals labeled according to your latest configuration. ...

Opentelemetry Grafana Apm Stack

Application Performance Monitoring is the ultimate level of observability in your systems. This comes on top of infra, network and other types of monitoring, providing info about the health and perfrmance of your applications or services. OpenTelemetry is CNCF project providing standard standard way to collect telemetry data. Supports metrics, traces, and logs with vendor-neutral APIs and SDKs. It goes together with Grafana supported stack to store and visualize signals collected from monitored systems. The stack is made of OpenTelemetry SDK/API, Opentelemetry Collector to collect and transfer signals from applications, DBs, servers and other components of your system. Othe part is made of Grafana stack to store, search and visualize signals: Mimir (Prometheus) to store metrics, Loki to store logs and Tempo to store traces. ...

Provision Mimir Alert Using Curl

Alertmanager is part of Mimir. It will store rules and check them against the received metrics. When the rule is triggered it will send notification to defined notification chanels. It provides API so you can automate alerts provisioning. You could keep alerts under source control and create them from CICD. Alert Alerts are defined in yaml files. Here is sample: # alert.yaml groups: - name: cpu_alerts interval: 30s rules: - alert: HighCPUUsage expr: system_cpu_time_seconds > 100 for: 1m labels: severity: warning annotations: summary: "High CPU usage detected" description: "CPU time exceeded threshold" Curl Mimir API curl -X POST http://<MIMIR_URL>/api/v1/rules/cpu_alerts \ -H "Content-Type: application/yaml" \ --data-binary @alert.yaml

OpenTelemetry Snmp Monitoring

SNMP is used to monitor many different devices like routers, UPSs, Storages, etc… The ultimate SRE startegy is to inify all monitoring systems into one observability platform and get unique place where you see all info needed in case of emergency. Here is how you can use OpenTelemetry to collect SNMP data from your devices and send it to Prometheus or Mimir. Then you can use Grafana to visualize and create alerts. ...

Opentelemetry Database Monitoring

Opentelemetry collector can help us to get detailed database metrics querying database system tables and views, parsing the results and creating metrics and logs for observability platform, ie. Prometheus or Mimir for metrics and Loki for logs. Basic requirement could be to see top 10 queries consuming the most of your resources. Strategy OpenTelemetry Configuration Strategy MySql and MS SQL assign unique label to each query (digest or query hash) so you can use this field (column) to correlate metrics to logs. This will improve your data ingestion on the observability platform side. Otherwise, you could end up with metrics with huge labels (full query text) and this can affect performanse or indexing in Prometheus or Mimir. ...

Automate Grafana Provisioning

In big deployments Grafana may have big number of Orgs, datasources, dashboards. If you need to automate provisioning of Grafana resources, you may try to create your own scripts and tools or try to find some open source solutions to cover your use case. This comes down to 2 options: use Grafana API or use file based provisioning. Typical Use Case The typical use could be: create Azure AD group for users ...