feat(metrics): expose Prometheus /metrics endpoint#37
feat(metrics): expose Prometheus /metrics endpoint#37edilsonoliveirama wants to merge 4 commits intoEvolutionAPI:mainfrom
Conversation
The /manager dashboard previously showed only a static placeholder
("Dashboard content will be implemented here..."). This replaces it
with a standalone HTML page that fetches live data from the API and
displays real metrics:
- Total instances count
- Connected instances count and percentage
- Disconnected instances count
- Server health status (GET /server/ok)
- AlwaysOnline count
- Instance table with name, status badge, phone number, client and
AlwaysOnline indicator
- Auto-refresh every 30 seconds with manual refresh button
Implementation uses a standalone HTML file (Tailwind CDN + vanilla JS
fetch) served at GET /manager, keeping the existing compiled bundle
intact for all other routes (/manager/instances, /manager/login, etc.).
Changes:
- manager/dashboard/index.html: new self-contained dashboard page
- pkg/routes/routes.go: serve dashboard/index.html for GET /manager
(exact), keep dist/index.html for GET /manager/*any (wildcard)
- Dockerfile: copy manager/dashboard/ into the final image
- .gitignore: exclude manager build artifacts from version control
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Removes the '// TODO: not working' markers from the six chat endpoints (pin, unpin, archive, unarchive, mute, unmute). Investigation confirmed the implementation is correct: the endpoints work on fully-established sessions that have synced WhatsApp app state keys. The markers were likely added after testing on a fresh session where keys had not yet been distributed by the WhatsApp server. Also fixes the hardcoded 1-hour mute duration: the BodyStruct now accepts an optional `duration` field (seconds). Sending 0 or omitting the field mutes the chat indefinitely, matching WhatsApp's own behaviour.
Reject negative duration values with a 400-level validation error. Document that duration=0 maps to 'mute forever' (BuildMute treats 0 as a zero time.Duration, which causes BuildMuteAbs to set the WhatsApp sentinel timestamp of -1). Clamp duration to a maximum of 1 year (31536000 seconds) to avoid unreasonably large timestamps being sent to the WhatsApp API.
Adds GET /metrics serving standard Prometheus text format. No authentication required — follows the Prometheus convention of protecting the endpoint at the network/ingress level. Metrics exposed: evolution_instances_total total registered instances (gauge) evolution_instances_connected connected instances (gauge) evolution_instances_disconnected disconnected instances (gauge) evolution_http_requests_total HTTP requests by method/path/status (counter) evolution_http_request_duration_seconds HTTP latency by method/path (histogram) evolution_build_info always 1, version label carries the value (gauge) evolution_uptime_seconds seconds since server start (gauge) Instance gauges use a custom Collector that queries the database on each scrape, so values are always current without event hooks. HTTP path labels use Gin registered route patterns (e.g. /instance/:instanceId) to keep cardinality bounded regardless of distinct IDs in the path. New dependency: github.com/prometheus/client_golang v1.20.5
Reviewer's GuideAdds a Prometheus-based metrics subsystem with an unauthenticated GET /metrics endpoint, integrates HTTP metrics middleware, exposes instance-level gauges via a custom collector, introduces a new HTML dashboard for real-time instance status, enhances chat mute behavior with configurable duration and validation, and slightly adjusts routing and dependencies to support these features. Sequence diagram for Prometheus scraping the new /metrics endpointsequenceDiagram
participant Prometheus as Prometheus
participant GinEngine as GinEngine
participant MetricsRegistry as MetricsRegistry
participant PrometheusRegistry as PrometheusRegistry
participant InstanceCollector as instanceCollector
participant InstanceRepository as InstanceRepository
participant Database as Database
Prometheus->>GinEngine: GET /metrics
GinEngine->>MetricsRegistry: Handler()
GinEngine->>PrometheusRegistry: ServeHTTP(response, request)
Note over PrometheusRegistry,InstanceCollector: PrometheusRegistry gathers all registered metrics
PrometheusRegistry->>InstanceCollector: Collect(ch)
InstanceCollector->>InstanceRepository: GetAllInstances()
InstanceRepository->>Database: SELECT * FROM instances
Database-->>InstanceRepository: instances rows
InstanceRepository-->>InstanceCollector: []*Instance
InstanceCollector-->>PrometheusRegistry: Gauge metrics (total, connected, disconnected)
PrometheusRegistry-->>GinEngine: Text exposition format
GinEngine-->>Prometheus: 200 OK
Sequence diagram for HTTP request metrics via Gin middlewaresequenceDiagram
participant Client as HttpClient
participant GinEngine as GinEngine
participant GinContext as GinContext
participant MetricsRegistry as MetricsRegistry
participant Handler as RouteHandler
Client->>GinEngine: HTTP request
GinEngine->>GinContext: Create context
GinEngine->>MetricsRegistry: GinMiddleware()
MetricsRegistry->>GinContext: Wrap handler with timing
GinContext->>Handler: Invoke route handler
Handler-->>GinContext: Write response
GinContext-->>MetricsRegistry: Status, method, path, duration
MetricsRegistry-->>MetricsRegistry: httpRequests.WithLabelValues(...).Inc()
MetricsRegistry-->>MetricsRegistry: httpDuration.WithLabelValues(...).Observe()
GinEngine-->>Client: HTTP response
Updated class diagram for metrics registry, instance collector, and chat mute APIclassDiagram
class Registry {
-prometheus.Registry reg
-prometheus.CounterVec httpRequests
-prometheus.HistogramVec httpDuration
+New(version string, instanceRepo InstanceRepository) Registry
+Handler() http.Handler
+GinMiddleware() gin.HandlerFunc
}
class instanceCollector {
-InstanceRepository repo
-*prometheus.Desc descTotal
-*prometheus.Desc descConnected
-*prometheus.Desc descDisconnected
+Describe(ch chan<- *prometheus.Desc)
+Collect(ch chan<- prometheus.Metric)
+newInstanceCollector(repo InstanceRepository) prometheus.Collector
}
class InstanceRepository {
<<interface>>
+GetAllInstances() ([]*Instance, error)
+GetAllConnectedInstances() ([]*Instance, error)
+GetAllConnectedInstancesByClientName(clientName string) ([]*Instance, error)
+GetAll(clientName string) ([]*Instance, error)
+Delete(instanceId string) error
+GetAdvancedSettings(instanceId string) (*AdvancedSettings, error)
+UpdateAdvancedSettings(instanceId string, settings *AdvancedSettings) error
}
class BodyStruct {
+string Chat
+int64 Duration
}
class chatService {
+ChatMute(data *BodyStruct, instance *Instance) (string, error)
+ChatUnmute(data *BodyStruct, instance *Instance) (string, error)
}
class appstate {
+BuildMute(recipient JID, mute bool, duration time.Duration) AppState
}
class ChatHandler {
+ChatMute(ctx *gin.Context)
}
class MaxMuteNote {
<<note>>
Constant: maxMuteDurationSeconds = 365 * 24 * 3600 (1 year cap)
}
Registry --> instanceCollector : registers
Registry --> InstanceRepository : uses
instanceCollector --> InstanceRepository : queries
chatService --> BodyStruct : consumes
chatService --> appstate : calls BuildMute
ChatHandler --> chatService : calls ChatMute
chatService .. MaxMuteNote
Flow diagram for the metrics dashboard data loadingflowchart TD
User["User opens /manager"] --> Browser["Browser loads dashboard index.html"]
Browser --> LoadDataFunc["loadData() JS function"]
LoadDataFunc --> FetchInstances["fetch /instance/all (with apikey)"]
LoadDataFunc --> FetchServerOk["fetch /server/ok (with apikey)"]
FetchInstances -->|success| UpdateInstanceMetrics["Update cards: total, connected, disconnected, AlwaysOnline"]
FetchInstances -->|success| RenderTable["Render instances table"]
FetchInstances -->|error| ShowInstanceError["Show error and hint about API key"]
FetchServerOk -->|status ok| UpdateServerOnline["Show server Online, green icon"]
FetchServerOk -->|error or !ok| UpdateServerError["Show server error status"]
UpdateInstanceMetrics --> Done["Dashboard visible"]
RenderTable --> Done
ShowInstanceError --> Done
UpdateServerOnline --> Done
UpdateServerError --> Done
Done --> Interval["setInterval(loadData, 30000)"]
User --> RefreshButton["Click Atualizar"]
RefreshButton --> LoadDataFunc
Interval --> LoadDataFunc
File-Level Changes
Tips and commandsInteracting with Sourcery
Customizing Your ExperienceAccess your dashboard to:
Getting Help
|
There was a problem hiding this comment.
Hey - I've found 2 issues
Prompt for AI Agents
Please address the comments from this code review:
## Individual Comments
### Comment 1
<location path="pkg/metrics/metrics.go" line_range="137-140" />
<code_context>
+ ch <- c.descDisconnected
+}
+
+func (c *instanceCollector) Collect(ch chan<- prometheus.Metric) {
+ instances, err := c.repo.GetAllInstances()
+ if err != nil {
+ // Emit nothing on error rather than stale data.
+ return
+ }
</code_context>
<issue_to_address>
**suggestion (bug_risk):** Silently returning on repository errors hides scrape issues and makes diagnostics harder.
If `GetAllInstances` fails, the scrape appears successful but all `evolution_instances_*` series vanish, making failures hard to notice or alert on. Please surface this error—e.g., via an explicit health/error metric (like a gauge `evolution_instance_metrics_up` set to 0 on error, 1 on success), `prometheus.NewInvalidMetric`, and/or logging the error—to improve observability and debugging.
Suggested implementation:
```golang
func (c *instanceCollector) Collect(ch chan<- prometheus.Metric) {
instances, err := c.repo.GetAllInstances()
if err != nil {
// Surface repository errors as an invalid metric so scrape issues are visible.
ch <- prometheus.NewInvalidMetric(c.descTotal, err)
return
}
```
If you prefer a dedicated health gauge instead of (or in addition to) `NewInvalidMetric`, you'll need to:
1. Add a new descriptor to `instanceCollector` (e.g., `descUp *prometheus.Desc`) and initialize it where the collector is constructed, with a name like `evolution_instance_metrics_up`.
2. Emit that metric in `Collect`, setting it to `0` on error and `1` on success, and include it in `Describe`.
</issue_to_address>
### Comment 2
<location path="pkg/metrics/metrics.go" line_range="52-58" />
<code_context>
+ return time.Since(startTime).Seconds()
+ })
+
+ reg.MustRegister(
+ httpRequests,
+ httpDuration,
+ buildInfo,
+ uptimeGauge,
+ newInstanceCollector(instanceRepo),
+ )
+
+ return &Registry{
</code_context>
<issue_to_address>
**suggestion:** The custom Prometheus registry omits Go and process collectors, reducing observability of runtime/resource behavior.
Because this uses a standalone `prometheus.Registry`, it won’t include the default `go_*` or `process_*` metrics from the global registry. To retain standard CPU/memory/goroutine and process visibility, also register `prometheus.NewGoCollector()` and `prometheus.NewProcessCollector(prometheus.ProcessCollectorOpts{})` on `reg` alongside your custom metrics.
```suggestion
reg.MustRegister(
httpRequests,
httpDuration,
buildInfo,
uptimeGauge,
newInstanceCollector(instanceRepo),
prometheus.NewGoCollector(),
prometheus.NewProcessCollector(prometheus.ProcessCollectorOpts{}),
)
```
</issue_to_address>Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.
| func (c *instanceCollector) Collect(ch chan<- prometheus.Metric) { | ||
| instances, err := c.repo.GetAllInstances() | ||
| if err != nil { | ||
| // Emit nothing on error rather than stale data. |
There was a problem hiding this comment.
suggestion (bug_risk): Silently returning on repository errors hides scrape issues and makes diagnostics harder.
If GetAllInstances fails, the scrape appears successful but all evolution_instances_* series vanish, making failures hard to notice or alert on. Please surface this error—e.g., via an explicit health/error metric (like a gauge evolution_instance_metrics_up set to 0 on error, 1 on success), prometheus.NewInvalidMetric, and/or logging the error—to improve observability and debugging.
Suggested implementation:
func (c *instanceCollector) Collect(ch chan<- prometheus.Metric) {
instances, err := c.repo.GetAllInstances()
if err != nil {
// Surface repository errors as an invalid metric so scrape issues are visible.
ch <- prometheus.NewInvalidMetric(c.descTotal, err)
return
}If you prefer a dedicated health gauge instead of (or in addition to) NewInvalidMetric, you'll need to:
- Add a new descriptor to
instanceCollector(e.g.,descUp *prometheus.Desc) and initialize it where the collector is constructed, with a name likeevolution_instance_metrics_up. - Emit that metric in
Collect, setting it to0on error and1on success, and include it inDescribe.
| reg.MustRegister( | ||
| httpRequests, | ||
| httpDuration, | ||
| buildInfo, | ||
| uptimeGauge, | ||
| newInstanceCollector(instanceRepo), | ||
| ) |
There was a problem hiding this comment.
suggestion: The custom Prometheus registry omits Go and process collectors, reducing observability of runtime/resource behavior.
Because this uses a standalone prometheus.Registry, it won’t include the default go_* or process_* metrics from the global registry. To retain standard CPU/memory/goroutine and process visibility, also register prometheus.NewGoCollector() and prometheus.NewProcessCollector(prometheus.ProcessCollectorOpts{}) on reg alongside your custom metrics.
| reg.MustRegister( | |
| httpRequests, | |
| httpDuration, | |
| buildInfo, | |
| uptimeGauge, | |
| newInstanceCollector(instanceRepo), | |
| ) | |
| reg.MustRegister( | |
| httpRequests, | |
| httpDuration, | |
| buildInfo, | |
| uptimeGauge, | |
| newInstanceCollector(instanceRepo), | |
| prometheus.NewGoCollector(), | |
| prometheus.NewProcessCollector(prometheus.ProcessCollectorOpts{}), | |
| ) |
paluan-batista
left a comment
There was a problem hiding this comment.
Opa bom dia amigo tudo bem?
Afim de ajudá-lo com um review no seu projeto(pr), deixei alguns comentários pontuais como sugestão.
| r := gin.Default() | ||
|
|
||
| metricsRegistry := metrics.New(version, instanceRepository) | ||
| r.Use(metricsRegistry.GinMiddleware()) |
There was a problem hiding this comment.
Sugestão: Mova essa linha para depois da configuração do CORS. Isso evita que o coletor de métricas processe e registre requests que seriam bloqueados imediatamente por políticas de segurança.
| } | ||
|
|
||
| func (c *instanceCollector) Collect(ch chan<- prometheus.Metric) { | ||
| instances, err := c.repo.GetAllInstances() |
There was a problem hiding this comment.
Sugestão: Esta query roda a cada "scrape" do Prometheus. Para milhares de instâncias, isso causará picos de carga no banco. Considere usar um cache ou métricas incrementais em vez de consultar o banco em tempo real.
| if data.Duration < 0 { | ||
| return "", errors.New("duration must be >= 0 (0 = mute forever)") | ||
| } | ||
| if data.Duration > maxMuteDurationSeconds { |
There was a problem hiding this comment.
Sugestão: A verificação de maxMuteDurationSeconds está correta. Certifique-se apenas de que o erro retornado seja claro (ex: "mute duration exceeds 1 year limit") para que o usuário saiba por que a ação falhou.
O que foi adicionado
Novo endpoint
GET /metricsque expõe métricas no formato texto padrão do Prometheus. Sem autenticação — seguindo a convenção do Prometheus de proteger o endpoint na camada de rede/ingress.Métricas expostas
evolution_instances_totalevolution_instances_connectedevolution_instances_disconnectedevolution_http_requests_totalevolution_http_request_duration_secondsevolution_build_infoversioncontém a versãoevolution_uptime_secondsDetalhes técnicos
Collectorcustomizado que consulta o banco a cada scrape — valores sempre atuais, sem necessidade de hooks em eventos/instance/:instanceId), mantendo a cardinalidade controladaNova dependência
github.com/prometheus/client_golang v1.20.5Exemplo de uso com Grafana
Configure um datasource Prometheus apontando para
http://<host>:<port>/metricse importe um dashboard padrão de Go ou crie painéis com as métricasevolution_*.Summary by Sourcery
Expose a Prometheus-compatible /metrics endpoint and add a lightweight HTML dashboard for monitoring instance and server health, while extending chat mute functionality to support configurable durations and tightening related APIs.
New Features:
Bug Fixes:
Enhancements:
Build: