Skip to content

feat(metrics): expose Prometheus /metrics endpoint#37

Open
edilsonoliveirama wants to merge 4 commits intoEvolutionAPI:mainfrom
edilsonoliveirama:feat/prometheus-metrics
Open

feat(metrics): expose Prometheus /metrics endpoint#37
edilsonoliveirama wants to merge 4 commits intoEvolutionAPI:mainfrom
edilsonoliveirama:feat/prometheus-metrics

Conversation

@edilsonoliveirama
Copy link
Copy Markdown

@edilsonoliveirama edilsonoliveirama commented Apr 20, 2026

O que foi adicionado

Novo endpoint GET /metrics que expõe métricas no formato texto padrão do Prometheus. Sem autenticação — seguindo a convenção do Prometheus de proteger o endpoint na camada de rede/ingress.

Métricas expostas

Métrica Tipo Descrição
evolution_instances_total gauge Total de instâncias registradas
evolution_instances_connected gauge Instâncias conectadas ao WhatsApp
evolution_instances_disconnected gauge Instâncias desconectadas
evolution_http_requests_total counter Requisições HTTP por method/path/status
evolution_http_request_duration_seconds histogram Latência HTTP por method/path
evolution_build_info gauge Sempre 1; label version contém a versão
evolution_uptime_seconds gauge Segundos desde o start do servidor

Detalhes técnicos

  • As métricas de instância usam um Collector customizado que consulta o banco a cada scrape — valores sempre atuais, sem necessidade de hooks em eventos
  • Labels de path HTTP usam o padrão registrado no Gin (ex: /instance/:instanceId), mantendo a cardinalidade controlada
  • Registry isolado (não usa o registry global do Prometheus), evitando conflitos com outras libs

Nova dependência

github.com/prometheus/client_golang v1.20.5

Exemplo de uso com Grafana

Configure um datasource Prometheus apontando para http://<host>:<port>/metrics e importe um dashboard padrão de Go ou crie painéis com as métricas evolution_*.

Summary by Sourcery

Expose a Prometheus-compatible /metrics endpoint and add a lightweight HTML dashboard for monitoring instance and server health, while extending chat mute functionality to support configurable durations and tightening related APIs.

New Features:

  • Add a Prometheus /metrics HTTP endpoint with isolated registry and Gin middleware to expose application, HTTP, and instance metrics.
  • Introduce a standalone manager dashboard page that visualizes instance status and server health using existing API endpoints.
  • Allow chat mute operations to specify a mute duration in seconds, including support for permanent mutes via duration 0.

Bug Fixes:

  • Remove obsolete TODO comments indicating chat pin/archive/mute endpoints were non-functional, aligning annotations with current behavior.

Enhancements:

  • Extend the instance repository with a method to retrieve all instances for use by metrics collectors.
  • Clarify the chat mute API documentation to describe duration semantics and example values.
  • Adjust manager routes to differentiate the new dashboard entry point from the existing React bundle routing.

Build:

  • Add Prometheus client libraries as direct and indirect Go module dependencies and ensure the dashboard assets are built into the Docker image.

edilsonoliveirama and others added 4 commits April 20, 2026 13:28
The /manager dashboard previously showed only a static placeholder
("Dashboard content will be implemented here..."). This replaces it
with a standalone HTML page that fetches live data from the API and
displays real metrics:

- Total instances count
- Connected instances count and percentage
- Disconnected instances count
- Server health status (GET /server/ok)
- AlwaysOnline count
- Instance table with name, status badge, phone number, client and
  AlwaysOnline indicator
- Auto-refresh every 30 seconds with manual refresh button

Implementation uses a standalone HTML file (Tailwind CDN + vanilla JS
fetch) served at GET /manager, keeping the existing compiled bundle
intact for all other routes (/manager/instances, /manager/login, etc.).

Changes:
- manager/dashboard/index.html: new self-contained dashboard page
- pkg/routes/routes.go: serve dashboard/index.html for GET /manager
  (exact), keep dist/index.html for GET /manager/*any (wildcard)
- Dockerfile: copy manager/dashboard/ into the final image
- .gitignore: exclude manager build artifacts from version control

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Removes the '// TODO: not working' markers from the six chat endpoints
(pin, unpin, archive, unarchive, mute, unmute). Investigation confirmed
the implementation is correct: the endpoints work on fully-established
sessions that have synced WhatsApp app state keys. The markers were
likely added after testing on a fresh session where keys had not yet
been distributed by the WhatsApp server.

Also fixes the hardcoded 1-hour mute duration: the BodyStruct now
accepts an optional `duration` field (seconds). Sending 0 or omitting
the field mutes the chat indefinitely, matching WhatsApp's own behaviour.
Reject negative duration values with a 400-level validation error.
Document that duration=0 maps to 'mute forever' (BuildMute treats 0
as a zero time.Duration, which causes BuildMuteAbs to set the
WhatsApp sentinel timestamp of -1).
Clamp duration to a maximum of 1 year (31536000 seconds) to avoid
unreasonably large timestamps being sent to the WhatsApp API.
Adds GET /metrics serving standard Prometheus text format.
No authentication required — follows the Prometheus convention of
protecting the endpoint at the network/ingress level.

Metrics exposed:

  evolution_instances_total               total registered instances (gauge)
  evolution_instances_connected           connected instances (gauge)
  evolution_instances_disconnected        disconnected instances (gauge)
  evolution_http_requests_total           HTTP requests by method/path/status (counter)
  evolution_http_request_duration_seconds HTTP latency by method/path (histogram)
  evolution_build_info                    always 1, version label carries the value (gauge)
  evolution_uptime_seconds                seconds since server start (gauge)

Instance gauges use a custom Collector that queries the database on
each scrape, so values are always current without event hooks.
HTTP path labels use Gin registered route patterns (e.g. /instance/:instanceId)
to keep cardinality bounded regardless of distinct IDs in the path.

New dependency: github.com/prometheus/client_golang v1.20.5
@sourcery-ai
Copy link
Copy Markdown

sourcery-ai Bot commented Apr 20, 2026

Reviewer's Guide

Adds a Prometheus-based metrics subsystem with an unauthenticated GET /metrics endpoint, integrates HTTP metrics middleware, exposes instance-level gauges via a custom collector, introduces a new HTML dashboard for real-time instance status, enhances chat mute behavior with configurable duration and validation, and slightly adjusts routing and dependencies to support these features.

Sequence diagram for Prometheus scraping the new /metrics endpoint

sequenceDiagram
    participant Prometheus as Prometheus
    participant GinEngine as GinEngine
    participant MetricsRegistry as MetricsRegistry
    participant PrometheusRegistry as PrometheusRegistry
    participant InstanceCollector as instanceCollector
    participant InstanceRepository as InstanceRepository
    participant Database as Database

    Prometheus->>GinEngine: GET /metrics
    GinEngine->>MetricsRegistry: Handler()
    GinEngine->>PrometheusRegistry: ServeHTTP(response, request)

    Note over PrometheusRegistry,InstanceCollector: PrometheusRegistry gathers all registered metrics

    PrometheusRegistry->>InstanceCollector: Collect(ch)
    InstanceCollector->>InstanceRepository: GetAllInstances()
    InstanceRepository->>Database: SELECT * FROM instances
    Database-->>InstanceRepository: instances rows
    InstanceRepository-->>InstanceCollector: []*Instance

    InstanceCollector-->>PrometheusRegistry: Gauge metrics (total, connected, disconnected)

    PrometheusRegistry-->>GinEngine: Text exposition format
    GinEngine-->>Prometheus: 200 OK
Loading

Sequence diagram for HTTP request metrics via Gin middleware

sequenceDiagram
    participant Client as HttpClient
    participant GinEngine as GinEngine
    participant GinContext as GinContext
    participant MetricsRegistry as MetricsRegistry
    participant Handler as RouteHandler

    Client->>GinEngine: HTTP request
    GinEngine->>GinContext: Create context
    GinEngine->>MetricsRegistry: GinMiddleware()
    MetricsRegistry->>GinContext: Wrap handler with timing

    GinContext->>Handler: Invoke route handler
    Handler-->>GinContext: Write response

    GinContext-->>MetricsRegistry: Status, method, path, duration
    MetricsRegistry-->>MetricsRegistry: httpRequests.WithLabelValues(...).Inc()
    MetricsRegistry-->>MetricsRegistry: httpDuration.WithLabelValues(...).Observe()

    GinEngine-->>Client: HTTP response
Loading

Updated class diagram for metrics registry, instance collector, and chat mute API

classDiagram
    class Registry {
        -prometheus.Registry reg
        -prometheus.CounterVec httpRequests
        -prometheus.HistogramVec httpDuration
        +New(version string, instanceRepo InstanceRepository) Registry
        +Handler() http.Handler
        +GinMiddleware() gin.HandlerFunc
    }

    class instanceCollector {
        -InstanceRepository repo
        -*prometheus.Desc descTotal
        -*prometheus.Desc descConnected
        -*prometheus.Desc descDisconnected
        +Describe(ch chan<- *prometheus.Desc)
        +Collect(ch chan<- prometheus.Metric)
        +newInstanceCollector(repo InstanceRepository) prometheus.Collector
    }

    class InstanceRepository {
        <<interface>>
        +GetAllInstances() ([]*Instance, error)
        +GetAllConnectedInstances() ([]*Instance, error)
        +GetAllConnectedInstancesByClientName(clientName string) ([]*Instance, error)
        +GetAll(clientName string) ([]*Instance, error)
        +Delete(instanceId string) error
        +GetAdvancedSettings(instanceId string) (*AdvancedSettings, error)
        +UpdateAdvancedSettings(instanceId string, settings *AdvancedSettings) error
    }

    class BodyStruct {
        +string Chat
        +int64 Duration
    }

    class chatService {
        +ChatMute(data *BodyStruct, instance *Instance) (string, error)
        +ChatUnmute(data *BodyStruct, instance *Instance) (string, error)
    }

    class appstate {
        +BuildMute(recipient JID, mute bool, duration time.Duration) AppState
    }

    class ChatHandler {
        +ChatMute(ctx *gin.Context)
    }

    class MaxMuteNote {
        <<note>>
        Constant: maxMuteDurationSeconds = 365 * 24 * 3600 (1 year cap)
    }

    Registry --> instanceCollector : registers
    Registry --> InstanceRepository : uses
    instanceCollector --> InstanceRepository : queries
    chatService --> BodyStruct : consumes
    chatService --> appstate : calls BuildMute
    ChatHandler --> chatService : calls ChatMute
    chatService .. MaxMuteNote
Loading

Flow diagram for the metrics dashboard data loading

flowchart TD
    User["User opens /manager"] --> Browser["Browser loads dashboard index.html"]
    Browser --> LoadDataFunc["loadData() JS function"]
    LoadDataFunc --> FetchInstances["fetch /instance/all (with apikey)"]
    LoadDataFunc --> FetchServerOk["fetch /server/ok (with apikey)"]

    FetchInstances -->|success| UpdateInstanceMetrics["Update cards: total, connected, disconnected, AlwaysOnline"]
    FetchInstances -->|success| RenderTable["Render instances table"]
    FetchInstances -->|error| ShowInstanceError["Show error and hint about API key"]

    FetchServerOk -->|status ok| UpdateServerOnline["Show server Online, green icon"]
    FetchServerOk -->|error or !ok| UpdateServerError["Show server error status"]

    UpdateInstanceMetrics --> Done["Dashboard visible"]
    RenderTable --> Done
    ShowInstanceError --> Done
    UpdateServerOnline --> Done
    UpdateServerError --> Done

    Done --> Interval["setInterval(loadData, 30000)"]
    User --> RefreshButton["Click Atualizar"]
    RefreshButton --> LoadDataFunc
    Interval --> LoadDataFunc
Loading

File-Level Changes

Change Details Files
Introduce Prometheus metrics registry, HTTP instrumentation middleware, and /metrics endpoint backed by a custom instance collector.
  • Create a dedicated metrics Registry that defines and registers HTTP request, latency, build info, uptime, and instance gauges using a local Prometheus registry.
  • Implement a custom Collector that queries the instance repository on each scrape to populate evolution_instances_total/connected/disconnected gauges.
  • Wire the metrics registry into the Gin router, attaching the metrics middleware globally and exposing an unauthenticated GET /metrics endpoint.
  • Extend the instance repository with a GetAllInstances method used by the instance metrics collector.
pkg/metrics/metrics.go
cmd/evolution-go/main.go
pkg/instance/repository/instance_repository.go
Add a standalone HTML manager dashboard that consumes existing APIs to display instance and server status, and adjust manager routing and Docker image layout to serve it.
  • Add manager/dashboard/index.html implementing a Tailwind-based dashboard that fetches /instance/all and /server/ok using the stored API key and auto-refreshes every 30 seconds.
  • Change Gin routes so /manager serves the new dashboard and /manager/*any serves the original SPA bundle, preserving client-side routing for the existing manager app.
  • Update Dockerfile to copy the new dashboard directory into the runtime image.
manager/dashboard/index.html
pkg/routes/routes.go
Dockerfile
Enhance chat mute functionality to accept a duration parameter with bounds checking and clarify API docs.
  • Extend the chat request body struct with an optional integer Duration field representing mute duration in seconds, used by mute operations.
  • Add validation in ChatMute to reject negative durations and cap maximum duration at one year in seconds.
  • Update ChatMute to pass the requested duration (including 0 for mute forever) to appstate.BuildMute instead of a fixed 1-hour mute.
  • Improve the Swagger description of the mute endpoint to document how to use the duration field.
pkg/chat/service/chat_service.go
pkg/chat/handler/chat_handler.go
Minor API and dependency cleanups to support new behavior.
  • Remove outdated TODO comments on chat pin/archive/mute routes, leaving behavior unchanged but reflecting current support status.
  • Add prometheus/client_golang and its transitive dependencies to go.mod and go.sum to support metrics collection.
pkg/routes/routes.go
go.mod
go.sum

Tips and commands

Interacting with Sourcery

  • Trigger a new review: Comment @sourcery-ai review on the pull request.
  • Continue discussions: Reply directly to Sourcery's review comments.
  • Generate a GitHub issue from a review comment: Ask Sourcery to create an
    issue from a review comment by replying to it. You can also reply to a
    review comment with @sourcery-ai issue to create an issue from it.
  • Generate a pull request title: Write @sourcery-ai anywhere in the pull
    request title to generate a title at any time. You can also comment
    @sourcery-ai title on the pull request to (re-)generate the title at any time.
  • Generate a pull request summary: Write @sourcery-ai summary anywhere in
    the pull request body to generate a PR summary at any time exactly where you
    want it. You can also comment @sourcery-ai summary on the pull request to
    (re-)generate the summary at any time.
  • Generate reviewer's guide: Comment @sourcery-ai guide on the pull
    request to (re-)generate the reviewer's guide at any time.
  • Resolve all Sourcery comments: Comment @sourcery-ai resolve on the
    pull request to resolve all Sourcery comments. Useful if you've already
    addressed all the comments and don't want to see them anymore.
  • Dismiss all Sourcery reviews: Comment @sourcery-ai dismiss on the pull
    request to dismiss all existing Sourcery reviews. Especially useful if you
    want to start fresh with a new review - don't forget to comment
    @sourcery-ai review to trigger a new review!

Customizing Your Experience

Access your dashboard to:

  • Enable or disable review features such as the Sourcery-generated pull request
    summary, the reviewer's guide, and others.
  • Change the review language.
  • Add, remove or edit custom review instructions.
  • Adjust other review settings.

Getting Help

Copy link
Copy Markdown

@sourcery-ai sourcery-ai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey - I've found 2 issues

Prompt for AI Agents
Please address the comments from this code review:

## Individual Comments

### Comment 1
<location path="pkg/metrics/metrics.go" line_range="137-140" />
<code_context>
+	ch <- c.descDisconnected
+}
+
+func (c *instanceCollector) Collect(ch chan<- prometheus.Metric) {
+	instances, err := c.repo.GetAllInstances()
+	if err != nil {
+		// Emit nothing on error rather than stale data.
+		return
+	}
</code_context>
<issue_to_address>
**suggestion (bug_risk):** Silently returning on repository errors hides scrape issues and makes diagnostics harder.

If `GetAllInstances` fails, the scrape appears successful but all `evolution_instances_*` series vanish, making failures hard to notice or alert on. Please surface this error—e.g., via an explicit health/error metric (like a gauge `evolution_instance_metrics_up` set to 0 on error, 1 on success), `prometheus.NewInvalidMetric`, and/or logging the error—to improve observability and debugging.

Suggested implementation:

```golang
func (c *instanceCollector) Collect(ch chan<- prometheus.Metric) {
	instances, err := c.repo.GetAllInstances()
	if err != nil {
		// Surface repository errors as an invalid metric so scrape issues are visible.
		ch <- prometheus.NewInvalidMetric(c.descTotal, err)
		return
	}

```

If you prefer a dedicated health gauge instead of (or in addition to) `NewInvalidMetric`, you'll need to:
1. Add a new descriptor to `instanceCollector` (e.g., `descUp *prometheus.Desc`) and initialize it where the collector is constructed, with a name like `evolution_instance_metrics_up`.
2. Emit that metric in `Collect`, setting it to `0` on error and `1` on success, and include it in `Describe`.
</issue_to_address>

### Comment 2
<location path="pkg/metrics/metrics.go" line_range="52-58" />
<code_context>
+		return time.Since(startTime).Seconds()
+	})
+
+	reg.MustRegister(
+		httpRequests,
+		httpDuration,
+		buildInfo,
+		uptimeGauge,
+		newInstanceCollector(instanceRepo),
+	)
+
+	return &Registry{
</code_context>
<issue_to_address>
**suggestion:** The custom Prometheus registry omits Go and process collectors, reducing observability of runtime/resource behavior.

Because this uses a standalone `prometheus.Registry`, it won’t include the default `go_*` or `process_*` metrics from the global registry. To retain standard CPU/memory/goroutine and process visibility, also register `prometheus.NewGoCollector()` and `prometheus.NewProcessCollector(prometheus.ProcessCollectorOpts{})` on `reg` alongside your custom metrics.

```suggestion
	reg.MustRegister(
		httpRequests,
		httpDuration,
		buildInfo,
		uptimeGauge,
		newInstanceCollector(instanceRepo),
		prometheus.NewGoCollector(),
		prometheus.NewProcessCollector(prometheus.ProcessCollectorOpts{}),
	)
```
</issue_to_address>

Sourcery is free for open source - if you like our reviews please consider sharing them ✨
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.

Comment thread pkg/metrics/metrics.go
Comment on lines +137 to +140
func (c *instanceCollector) Collect(ch chan<- prometheus.Metric) {
instances, err := c.repo.GetAllInstances()
if err != nil {
// Emit nothing on error rather than stale data.
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion (bug_risk): Silently returning on repository errors hides scrape issues and makes diagnostics harder.

If GetAllInstances fails, the scrape appears successful but all evolution_instances_* series vanish, making failures hard to notice or alert on. Please surface this error—e.g., via an explicit health/error metric (like a gauge evolution_instance_metrics_up set to 0 on error, 1 on success), prometheus.NewInvalidMetric, and/or logging the error—to improve observability and debugging.

Suggested implementation:

func (c *instanceCollector) Collect(ch chan<- prometheus.Metric) {
	instances, err := c.repo.GetAllInstances()
	if err != nil {
		// Surface repository errors as an invalid metric so scrape issues are visible.
		ch <- prometheus.NewInvalidMetric(c.descTotal, err)
		return
	}

If you prefer a dedicated health gauge instead of (or in addition to) NewInvalidMetric, you'll need to:

  1. Add a new descriptor to instanceCollector (e.g., descUp *prometheus.Desc) and initialize it where the collector is constructed, with a name like evolution_instance_metrics_up.
  2. Emit that metric in Collect, setting it to 0 on error and 1 on success, and include it in Describe.

Comment thread pkg/metrics/metrics.go
Comment on lines +52 to +58
reg.MustRegister(
httpRequests,
httpDuration,
buildInfo,
uptimeGauge,
newInstanceCollector(instanceRepo),
)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion: The custom Prometheus registry omits Go and process collectors, reducing observability of runtime/resource behavior.

Because this uses a standalone prometheus.Registry, it won’t include the default go_* or process_* metrics from the global registry. To retain standard CPU/memory/goroutine and process visibility, also register prometheus.NewGoCollector() and prometheus.NewProcessCollector(prometheus.ProcessCollectorOpts{}) on reg alongside your custom metrics.

Suggested change
reg.MustRegister(
httpRequests,
httpDuration,
buildInfo,
uptimeGauge,
newInstanceCollector(instanceRepo),
)
reg.MustRegister(
httpRequests,
httpDuration,
buildInfo,
uptimeGauge,
newInstanceCollector(instanceRepo),
prometheus.NewGoCollector(),
prometheus.NewProcessCollector(prometheus.ProcessCollectorOpts{}),
)

Copy link
Copy Markdown

@paluan-batista paluan-batista left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Opa bom dia amigo tudo bem?

Afim de ajudá-lo com um review no seu projeto(pr), deixei alguns comentários pontuais como sugestão.

Comment thread cmd/evolution-go/main.go
r := gin.Default()

metricsRegistry := metrics.New(version, instanceRepository)
r.Use(metricsRegistry.GinMiddleware())
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sugestão: Mova essa linha para depois da configuração do CORS. Isso evita que o coletor de métricas processe e registre requests que seriam bloqueados imediatamente por políticas de segurança.

Comment thread pkg/metrics/metrics.go
}

func (c *instanceCollector) Collect(ch chan<- prometheus.Metric) {
instances, err := c.repo.GetAllInstances()
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sugestão: Esta query roda a cada "scrape" do Prometheus. Para milhares de instâncias, isso causará picos de carga no banco. Considere usar um cache ou métricas incrementais em vez de consultar o banco em tempo real.

if data.Duration < 0 {
return "", errors.New("duration must be >= 0 (0 = mute forever)")
}
if data.Duration > maxMuteDurationSeconds {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sugestão: A verificação de maxMuteDurationSeconds está correta. Certifique-se apenas de que o erro retornado seja claro (ex: "mute duration exceeds 1 year limit") para que o usuário saiba por que a ação falhou.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants