Opentelemetry collector can help us to get detailed database metrics querying database system tables and views, parsing the results and creating metrics and logs for observability platform, ie. Prometheus or Mimir for metrics and Loki for logs. Basic requirement could be to see top 10 queries consuming the most of your resources.

Strategy

OpenTelemetry Configuration Strategy

MySql and MS SQL assign unique label to each query (digest or query hash) so you can use this field (column) to correlate metrics to logs. This will improve your data ingestion on the observability platform side. Otherwise, you could end up with metrics with huge labels (full query text) and this can affect performanse or indexing in Prometheus or Mimir.

To conclude, we will query system table/view calculate what we want in terms of metrics, ie cpu load, and create 1 metric with one of the labels storing digest or query hash and at the same time, create log storing full query text and having the same digest or hash in label.

I believe, this method will work for any database: MySQL, MS SQL and Oracle.

Calculate Top 10 Queries

Top 10 by Average Execution Time

Sort queries by average duration per execution (calculated as total_elapsed_time divided by execution_count).

When to use: To identify the slowest queries that might be causing performance bottlenecks.

Calculate the increase rate of total elapsed time and execution count over the window, then divide them:

average_duration = rate(total_elapsed_time[5m]) / rate(execution_count[5m])

Top 10 by Cumulative Execution Time

Sort queries by their total cumulative execution time (total_elapsed_time).

When to use: To find queries that have the biggest overall impact on system resources:

cumulative_time = rate(total_elapsed_time[5m])

Top 10 by Execution Frequency

Sort queries by the number of times they have been executed (execution_count).

When to use: To detect queries that cause load spikes due to high frequency.

execution_frequency = rate(execution_count[5m])

Combined Metrics

Use a weighted scoring method, for example, multiplying cumulative execution time by execution count.

When to use: To balance between queries that are slow and those that run frequently, identifying those with the greatest combined impact.

combined_score = rate(total_elapsed_time[5m]) * rate(execution_count[5m])

OpenTelemetry Configuration for TOP 10 Queries in MySQL

Use sqlquery receiver to query system table and calculate metrics. Here is sample config file for OtelCol:

sqlquery:
    driver: mysql
    datasource: "USER:PASS@tcp(DB.HOST.NAME:3306)/DATABASE"
    queries:
      - sql: "SELECT DIGEST, DIGEST_TEXT AS query, ROUND(SUM_TIMER_WAIT / 1e12, 3) AS myexectime FROM performance_schema.events_statements_summary_by_digest ORDER BY myexectime DESC LIMIT 10;"
        #        tracking_start_value: "0"
        #tracking_column: myexectime
        metrics:
          - metric_name: slow_query
            value_type: double
            value_column: "myexectime"
            attribute_columns: ["DIGEST"]
            static_attributes:
              dbinstance: pmp
        logs:
          - body_column: query
            attribute_columns: ["DIGEST"]

After applying this config for each MySQL query, we will get metric with name slow_query{digest=“unique_value_123”} and 1 log file storing query text and having the same label: digest=“unique_value_123”. When you create dashboard in Grafana, you will be able to correlate these 2 signals and get full info in one panel/dashboard but saving performances of both Mimir and Loki.

Here is Data Pipeline for OtelCol:

metrics:
      receivers: [sqlquery]
      processors: [resourcedetection/system, transform, batch]
      exporters: [debug]

    logs:
      receivers: [sqlquery]
      processors: [batch]
      exporters: [debug]