Documentation Index Fetch the complete documentation index at: https://mintlify.com/lamassuiot/lamassuiot/llms.txt
Use this file to discover all available pages before exploring further.
Effective monitoring is essential for operating a production PKI platform. Lamassu IoT provides comprehensive observability through OpenTelemetry, enabling metrics, traces, and logs to be exported to your monitoring stack.
Observability Architecture
Lamassu implements the OpenTelemetry standard for telemetry data:
┌─────────────────────────────────────────────┐
│ Lamassu Services │
│ ┌────────┐ ┌──────────┐ ┌─────────────┐ │
│ │ CA API │ │ Device │ │ DMS Manager │ │
│ │ │ │ Manager │ │ │ │
│ └────────┘ └──────────┘ └─────────────┘ │
│ │ │ │ │
│ └────────────┴──────────────┘ │
│ │ (OTLP) │
└──────────────┼──────────────────────────────┘
▼
┌────────────────┐
│ OTEL Collector │
└────────────────┘
│
┌───────┴────────┬──────────┐
▼ ▼ ▼
┌─────────┐ ┌─────────┐ ┌──────┐
│ Grafana │ │ Tempo │ │ Loki │
│ (Mimir) │ │(Traces) │ │(Logs)│
└─────────┘ └─────────┘ └──────┘
OpenTelemetry Configuration
Enabling Observability
Configure OpenTelemetry in your service configuration:
otel :
metrics :
enabled : true
interval_in_millis : 10000
hostname : "otel-collector"
port : 4318
scheme : "http"
traces :
enabled : true
hostname : "otel-collector"
port : 4318
scheme : "http"
logging :
enabled : true
hostname : "otel-collector"
port : 4318
scheme : "http"
OTLP Endpoints
Lamassu services export telemetry using OTLP (OpenTelemetry Protocol):
gRPC: Port 4317
HTTP: Port 4318
Production configuration with TLS:
otel :
metrics :
enabled : true
interval_in_millis : 30000
hostname : "otel-collector.internal.example.com"
port : 4318
scheme : "https"
traces :
enabled : true
hostname : "otel-collector.internal.example.com"
port : 4318
scheme : "https"
Quick Start with LGTM Stack
For development and testing, use the Grafana LGTM (Loki, Grafana, Tempo, Mimir) stack:
# otel/docker-compose.yaml
services :
otel-lgtm :
image : grafana/otel-lgtm:latest
ports :
- "3000:3000" # Grafana UI
- "4317:4317" # OTLP gRPC
- "4318:4318" # OTLP HTTP
- "3100:3100" # Loki
- "9090:9090" # Mimir (Prometheus)
- "3200:3200" # Tempo
environment :
- GF_AUTH_ANONYMOUS_ORG_ROLE=Admin
- GF_AUTH_ANONYMOUS_ENABLED=true
- GF_AUTH_BASIC_ENABLED=false
- GF_FEATURE_TOGGLES_ENABLE=traceqlEditor metricsSummary traceToMetrics
Start the stack:
docker-compose -f otel/docker-compose.yaml up -d
Access Grafana:
Key Metrics
Certificate Authority Metrics
Monitor CA health and usage:
Certificates issued per CA:
rate(ca_certificates_issued_total{ca_id="production-ca"}[5m])
Certificate signing latency:
histogram_quantile(0.95,
rate(ca_sign_operation_duration_seconds_bucket[5m])
)
Active CAs by crypto engine:
count(ca_info) by (engine_type)
Device Manager Metrics
Device enrollment rate:
rate(device_enrollments_total[5m])
Device status distribution:
count(device_info) by (status)
Devices approaching expiration:
count(device_certificate_expiry_seconds < 2592000) # < 30 days
EST Enrollment Metrics
Enrollment success rate:
rate(est_enrollment_success_total[5m]) /
rate(est_enrollment_total[5m])
Enrollment failures by reason:
sum(rate(est_enrollment_failed_total[5m])) by (failure_reason)
EST operation latency (p95):
histogram_quantile(0.95,
rate(est_operation_duration_seconds_bucket[5m])
) by (operation)
HTTP API Metrics
Lamassu automatically instruments HTTP endpoints with OpenTelemetry:
Request rate by endpoint:
rate(http_server_requests_total[5m]) by (http_route, http_method)
Response time by endpoint:
histogram_quantile(0.95,
rate(http_server_duration_bucket[5m])
) by (http_route)
Error rate by status code:
sum(rate(http_server_requests_total{http_status_code=~"5.."}[5m]))
by (http_route)
Database Metrics
GORM and PostgreSQL metrics:
Database connection pool utilization:
db_connections_in_use / db_connections_max
Query latency (p99):
histogram_quantile(0.99,
rate(db_query_duration_seconds_bucket[5m])
) by (operation)
Connection errors:
rate(db_connection_errors_total[5m])
Crypto Engine Metrics
Crypto operation latency by engine:
histogram_quantile(0.95,
rate(crypto_operation_duration_seconds_bucket[5m])
) by (engine_id, operation_type)
Crypto operation failures:
sum(rate(crypto_operation_failed_total[5m]))
by (engine_id, failure_reason)
Distributed Tracing
Trace Context Propagation
Lamassu propagates trace context across service boundaries using W3C Trace Context:
// Automatic trace propagation
otel . SetTextMapPropagator ( propagation . NewCompositeTextMapPropagator (
propagation . TraceContext {},
propagation . Baggage {},
))
Key Trace Operations
Certificate issuance flow:
POST /api/ca/v1/cas/{ca-id}/certificates/sign
│
├─ Validate CSR
│ └─ Parse X.509 CSR
│
├─ Check CA status
│ └─ Database query: SELECT * FROM cas WHERE id=?
│
├─ Crypto engine sign
│ ├─ Load CA private key
│ └─ Sign certificate (PKCS#11/Vault/AWS KMS)
│
├─ Store certificate
│ └─ Database insert: INSERT INTO certificates
│
└─ Publish event
└─ EventBus: certificate.issued
EST enrollment flow:
POST /.well-known/est/{dms-id}/simpleenroll
│
├─ Extract client certificate (mTLS)
│ └─ Validate against DMS validation CAs
│
├─ Parse CSR from request body
│ └─ Decode base64 DER CSR
│
├─ Get DMS configuration
│ └─ Database query: SELECT * FROM dms WHERE id=?
│
├─ Call CA service to sign
│ └─ HTTP: POST /api/ca/v1/cas/{ca-id}/certificates/sign
│ └─ (See certificate issuance flow above)
│
└─ Return PKCS#7 response
└─ Encode certificate in base64 DER
Viewing Traces in Grafana
Navigate to Explore in Grafana
Select Tempo data source
Use TraceQL to query traces:
{ span.http.route = "/api/ca/v1/cas/{id}/certificates/sign" }
Filter by latency:
{ span.http.route = "/api/ca/v1/cas/{id}/certificates/sign"
&& duration > 1s }
Find errors:
Logging
Structured Logging
Lamassu uses structured logging with configurable levels:
logs :
level : info # debug, info, warn, error
Log levels:
Level Use Case debugDevelopment, troubleshooting infoNormal operations, key events warnPotential issues, degraded performance errorErrors requiring attention
Log Fields
Standard fields in Lamassu logs:
{
"timestamp" : "2025-03-09T10:15:30.123Z" ,
"level" : "info" ,
"service" : "ca-api" ,
"trace_id" : "abc123def456" ,
"span_id" : "789xyz" ,
"message" : "Certificate issued" ,
"ca_id" : "production-ca" ,
"certificate_id" : "cert-12345" ,
"subject_cn" : "device-001" ,
"duration_ms" : 245
}
Querying Logs with LogQL
Use Loki’s LogQL to query logs in Grafana:
All error logs:
{service="ca-api"} |= "level=error"
EST enrollment failures:
{service="dms-manager"}
|= "enrollment"
|= "failed"
| json
| line_format "{{.timestamp}} [{{.dms_id}}] {{.failure_reason}}"
Certificate issuance rate:
sum(rate({service="ca-api"} |= "Certificate issued" [5m]))
Slow database queries:
{service=~".*"}
|= "database query"
| json
| duration_ms > 1000
Dashboards
Recommended Grafana Dashboards
PKI Overview
Total CAs and certificates
Certificate issuance rate
Expiration distribution
CA health status
Device Fleet
Total devices by status
Enrollment success rate
Devices near expiration
Geographic distribution
Service Health
HTTP request rate and latency
Error rates by service
Database connection pool
Crypto engine performance
Security Events
Failed authentications
Certificate revocations
CA operations (create/delete)
DMS configuration changes
Sample Dashboard Panels
Certificate expiration timeline:
sum(certificate_expiry_seconds < 7776000) by (ca_id) # < 90 days
Service availability (uptime):
avg_over_time(
up{job="lamassu-ca"}[1h]
) * 100
EST enrollment funnel:
sum(rate(est_enrollment_total[5m])) by (stage)
# Stages: received, authenticated, validated, signed, issued
Alerting
Alert Rules
Define alert rules in Prometheus/Mimir:
High error rate:
groups :
- name : lamassu_alerts
interval : 30s
rules :
- alert : HighErrorRate
expr : |
sum(rate(http_server_requests_total{http_status_code=~"5.."}[5m]))
by (service)
> 0.05
for : 5m
labels :
severity : warning
annotations :
summary : "High error rate on {{ $labels.service }}"
description : "Error rate is {{ $value | humanizePercentage }}"
Certificate expiration warning:
- alert : CertificatesExpiringSoon
expr : |
count(certificate_expiry_seconds < 604800) # < 7 days
> 10
for : 1h
labels :
severity : warning
annotations :
summary : "{{ $value }} certificates expiring within 7 days"
Crypto engine failure:
- alert : CryptoEngineFailure
expr : |
rate(crypto_operation_failed_total[5m]) > 0.01
for : 2m
labels :
severity : critical
annotations :
summary : "Crypto engine {{ $labels.engine_id }} experiencing failures"
Database connection exhaustion:
- alert : DatabaseConnectionPoolExhausted
expr : |
db_connections_in_use / db_connections_max > 0.9
for : 5m
labels :
severity : warning
annotations :
summary : "Database connection pool {{ $labels.service }} is 90% utilized"
Alert Channels
Configure notification channels in Grafana:
Email: Critical alerts to ops team
Slack/Teams: Real-time notifications
PagerDuty: On-call escalation for critical issues
Webhook: Integration with incident management systems
Health Checks
Service Health Endpoints
Each Lamassu service exposes health check endpoints:
Liveness probe:
curl http://localhost:8080/health/live
# Returns: {"status": "UP"}
Readiness probe:
curl http://localhost:8080/health/ready
# Returns: {"status": "UP", "checks": {"database": "UP", "eventbus": "UP"}}
Kubernetes Probes
apiVersion : v1
kind : Pod
metadata :
name : lamassu-ca
spec :
containers :
- name : ca-api
image : lamassuiot/ca:latest
livenessProbe :
httpGet :
path : /health/live
port : 8080
initialDelaySeconds : 30
periodSeconds : 10
readinessProbe :
httpGet :
path : /health/ready
port : 8080
initialDelaySeconds : 10
periodSeconds : 5
Resource Utilization
Monitor CPU, memory, and disk usage:
CPU usage:
rate(process_cpu_seconds_total{service="ca-api"}[5m]) * 100
Memory usage:
process_resident_memory_bytes{service="ca-api"} / 1024 / 1024 # MB
Disk I/O:
rate(node_disk_io_time_seconds_total[5m])
Capacity Planning
Track growth trends:
Certificate growth rate:
derive(count(certificate_info))
Database size growth:
pg_database_size_bytes{database="lamassu"}
Troubleshooting with Observability
Scenario: Slow Certificate Issuance
Check HTTP latency metrics:
histogram_quantile(0.95,
rate(http_server_duration_bucket{http_route="/api/ca/v1/cas/{id}/certificates/sign"}[5m])
)
View distributed trace to identify bottleneck:
Navigate to Tempo in Grafana
Search for slow traces: { duration > 2s }
Identify which span is taking the most time (DB query, crypto operation, etc.)
Check crypto engine metrics:
histogram_quantile(0.95,
rate(crypto_operation_duration_seconds_bucket[5m])
) by (engine_id)
Review logs for errors:
{service="ca-api"} |= "sign" |= "error"
Scenario: Enrollment Failures
Check enrollment failure rate:
rate(est_enrollment_failed_total[5m]) by (failure_reason)
Review authentication failures:
{service="dms-manager"} |= "authentication" |= "failed"
Trace a failed enrollment:
{ span.http.route = "/.well-known/est/{dms-id}/simpleenroll"
&& status = error }