Monitoring - Lamassu IoT

Effective monitoring is essential for operating a production PKI platform. Lamassu IoT provides comprehensive observability through OpenTelemetry, enabling metrics, traces, and logs to be exported to your monitoring stack.

Observability Architecture

Lamassu implements the OpenTelemetry standard for telemetry data:

┌─────────────────────────────────────────────┐
│  Lamassu Services                           │
│  ┌────────┐  ┌──────────┐  ┌─────────────┐ │
│  │ CA API │  │  Device  │  │ DMS Manager │ │
│  │        │  │  Manager │  │             │ │
│  └────────┘  └──────────┘  └─────────────┘ │
│       │            │              │         │
│       └────────────┴──────────────┘         │
│              │ (OTLP)                       │
└──────────────┼──────────────────────────────┘
               ▼
      ┌────────────────┐
      │ OTEL Collector │
      └────────────────┘
               │
       ┌───────┴────────┬──────────┐
       ▼                ▼          ▼
  ┌─────────┐    ┌─────────┐  ┌──────┐
  │ Grafana │    │  Tempo  │  │ Loki │
  │ (Mimir) │    │(Traces) │  │(Logs)│
  └─────────┘    └─────────┘  └──────┘

OpenTelemetry Configuration

Enabling Observability

Configure OpenTelemetry in your service configuration:

otel:
  metrics:
    enabled: true
    interval_in_millis: 10000
    hostname: "otel-collector"
    port: 4318
    scheme: "http"
  traces:
    enabled: true
    hostname: "otel-collector"
    port: 4318
    scheme: "http"
  logging:
    enabled: true
    hostname: "otel-collector"
    port: 4318
    scheme: "http"

OTLP Endpoints

Lamassu services export telemetry using OTLP (OpenTelemetry Protocol):

gRPC: Port 4317
HTTP: Port 4318

Production configuration with TLS:

otel:
  metrics:
    enabled: true
    interval_in_millis: 30000
    hostname: "otel-collector.internal.example.com"
    port: 4318
    scheme: "https"
  traces:
    enabled: true
    hostname: "otel-collector.internal.example.com"
    port: 4318
    scheme: "https"

Quick Start with LGTM Stack

For development and testing, use the Grafana LGTM (Loki, Grafana, Tempo, Mimir) stack:

# otel/docker-compose.yaml
services:
  otel-lgtm:
    image: grafana/otel-lgtm:latest
    ports:
      - "3000:3000"   # Grafana UI
      - "4317:4317"   # OTLP gRPC
      - "4318:4318"   # OTLP HTTP
      - "3100:3100"   # Loki
      - "9090:9090"   # Mimir (Prometheus)
      - "3200:3200"   # Tempo
    environment:
      - GF_AUTH_ANONYMOUS_ORG_ROLE=Admin
      - GF_AUTH_ANONYMOUS_ENABLED=true
      - GF_AUTH_BASIC_ENABLED=false
      - GF_FEATURE_TOGGLES_ENABLE=traceqlEditor metricsSummary traceToMetrics

Start the stack:

docker-compose -f otel/docker-compose.yaml up -d

Access Grafana:

http://localhost:3000

Key Metrics

Certificate Authority Metrics

Monitor CA health and usage: Certificates issued per CA:

rate(ca_certificates_issued_total{ca_id="production-ca"}[5m])

Certificate signing latency:

histogram_quantile(0.95, 
  rate(ca_sign_operation_duration_seconds_bucket[5m])
)

Active CAs by crypto engine:

count(ca_info) by (engine_type)

Device Manager Metrics

Device enrollment rate:

rate(device_enrollments_total[5m])

Device status distribution:

count(device_info) by (status)

Devices approaching expiration:

count(device_certificate_expiry_seconds < 2592000)  # < 30 days

EST Enrollment Metrics

Enrollment success rate:

rate(est_enrollment_success_total[5m]) / 
  rate(est_enrollment_total[5m])

Enrollment failures by reason:

sum(rate(est_enrollment_failed_total[5m])) by (failure_reason)

EST operation latency (p95):

histogram_quantile(0.95,
  rate(est_operation_duration_seconds_bucket[5m])
) by (operation)

HTTP API Metrics

Lamassu automatically instruments HTTP endpoints with OpenTelemetry: Request rate by endpoint:

rate(http_server_requests_total[5m]) by (http_route, http_method)

Response time by endpoint:

histogram_quantile(0.95,
  rate(http_server_duration_bucket[5m])
) by (http_route)

Error rate by status code:

sum(rate(http_server_requests_total{http_status_code=~"5.."}[5m])) 
  by (http_route)

Database Metrics

GORM and PostgreSQL metrics: Database connection pool utilization:

db_connections_in_use / db_connections_max

Query latency (p99):

histogram_quantile(0.99,
  rate(db_query_duration_seconds_bucket[5m])
) by (operation)

Connection errors:

rate(db_connection_errors_total[5m])

Crypto Engine Metrics

Crypto operation latency by engine:

histogram_quantile(0.95,
  rate(crypto_operation_duration_seconds_bucket[5m])
) by (engine_id, operation_type)

Crypto operation failures:

sum(rate(crypto_operation_failed_total[5m])) 
  by (engine_id, failure_reason)

Distributed Tracing

Trace Context Propagation

Lamassu propagates trace context across service boundaries using W3C Trace Context:

// Automatic trace propagation
otel.SetTextMapPropagator(propagation.NewCompositeTextMapPropagator(
    propagation.TraceContext{},
    propagation.Baggage{},
))

Key Trace Operations

Certificate issuance flow:

POST /api/ca/v1/cas/{ca-id}/certificates/sign
  │
  ├─ Validate CSR
  │   └─ Parse X.509 CSR
  │
  ├─ Check CA status
  │   └─ Database query: SELECT * FROM cas WHERE id=?
  │
  ├─ Crypto engine sign
  │   ├─ Load CA private key
  │   └─ Sign certificate (PKCS#11/Vault/AWS KMS)
  │
  ├─ Store certificate
  │   └─ Database insert: INSERT INTO certificates
  │
  └─ Publish event
      └─ EventBus: certificate.issued

EST enrollment flow:

POST /.well-known/est/{dms-id}/simpleenroll
  │
  ├─ Extract client certificate (mTLS)
  │   └─ Validate against DMS validation CAs
  │
  ├─ Parse CSR from request body
  │   └─ Decode base64 DER CSR
  │
  ├─ Get DMS configuration
  │   └─ Database query: SELECT * FROM dms WHERE id=?
  │
  ├─ Call CA service to sign
  │   └─ HTTP: POST /api/ca/v1/cas/{ca-id}/certificates/sign
  │       └─ (See certificate issuance flow above)
  │
  └─ Return PKCS#7 response
      └─ Encode certificate in base64 DER

Viewing Traces in Grafana

Navigate to Explore in Grafana
Select Tempo data source
Use TraceQL to query traces:

{ span.http.route = "/api/ca/v1/cas/{id}/certificates/sign" }

Filter by latency:

{ span.http.route = "/api/ca/v1/cas/{id}/certificates/sign" 
  && duration > 1s }

Find errors:

{ status = error }

Logging

Structured Logging

Lamassu uses structured logging with configurable levels:

logs:
  level: info  # debug, info, warn, error

Log levels:

Level	Use Case
`debug`	Development, troubleshooting
`info`	Normal operations, key events
`warn`	Potential issues, degraded performance
`error`	Errors requiring attention

Log Fields

Standard fields in Lamassu logs:

{
  "timestamp": "2025-03-09T10:15:30.123Z",
  "level": "info",
  "service": "ca-api",
  "trace_id": "abc123def456",
  "span_id": "789xyz",
  "message": "Certificate issued",
  "ca_id": "production-ca",
  "certificate_id": "cert-12345",
  "subject_cn": "device-001",
  "duration_ms": 245
}

Querying Logs with LogQL

Use Loki’s LogQL to query logs in Grafana: All error logs:

{service="ca-api"} |= "level=error"

EST enrollment failures:

{service="dms-manager"} 
  |= "enrollment" 
  |= "failed" 
  | json 
  | line_format "{{.timestamp}} [{{.dms_id}}] {{.failure_reason}}"

Certificate issuance rate:

sum(rate({service="ca-api"} |= "Certificate issued" [5m]))

Slow database queries:

{service=~".*"} 
  |= "database query" 
  | json 
  | duration_ms > 1000

Dashboards

Recommended Grafana Dashboards

PKI Overview

Total CAs and certificates
Certificate issuance rate
Expiration distribution
CA health status

Device Fleet

Total devices by status
Enrollment success rate
Devices near expiration
Geographic distribution

Service Health

HTTP request rate and latency
Error rates by service
Database connection pool
Crypto engine performance

Security Events

Failed authentications
Certificate revocations
CA operations (create/delete)
DMS configuration changes

Sample Dashboard Panels

Certificate expiration timeline:

sum(certificate_expiry_seconds < 7776000) by (ca_id)  # < 90 days

Service availability (uptime):

avg_over_time(
  up{job="lamassu-ca"}[1h]
) * 100

EST enrollment funnel:

sum(rate(est_enrollment_total[5m])) by (stage)
# Stages: received, authenticated, validated, signed, issued

Alerting

Alert Rules

Define alert rules in Prometheus/Mimir: High error rate:

groups:
  - name: lamassu_alerts
    interval: 30s
    rules:
      - alert: HighErrorRate
        expr: |
          sum(rate(http_server_requests_total{http_status_code=~"5.."}[5m])) 
            by (service) 
          > 0.05
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High error rate on {{ $labels.service }}"
          description: "Error rate is {{ $value | humanizePercentage }}"

Certificate expiration warning:

- alert: CertificatesExpiringSoon
  expr: |
    count(certificate_expiry_seconds < 604800)  # < 7 days
      > 10
  for: 1h
  labels:
    severity: warning
  annotations:
    summary: "{{ $value }} certificates expiring within 7 days"

Crypto engine failure:

- alert: CryptoEngineFailure
  expr: |
    rate(crypto_operation_failed_total[5m]) > 0.01
  for: 2m
  labels:
    severity: critical
  annotations:
    summary: "Crypto engine {{ $labels.engine_id }} experiencing failures"

Database connection exhaustion:

- alert: DatabaseConnectionPoolExhausted
  expr: |
    db_connections_in_use / db_connections_max > 0.9
  for: 5m
  labels:
    severity: warning
  annotations:
    summary: "Database connection pool {{ $labels.service }} is 90% utilized"

Alert Channels

Configure notification channels in Grafana:

Email: Critical alerts to ops team
Slack/Teams: Real-time notifications
PagerDuty: On-call escalation for critical issues
Webhook: Integration with incident management systems

Health Checks

Service Health Endpoints

Each Lamassu service exposes health check endpoints: Liveness probe:

curl http://localhost:8080/health/live
# Returns: {"status": "UP"}

Readiness probe:

curl http://localhost:8080/health/ready
# Returns: {"status": "UP", "checks": {"database": "UP", "eventbus": "UP"}}

Kubernetes Probes

apiVersion: v1
kind: Pod
metadata:
  name: lamassu-ca
spec:
  containers:
  - name: ca-api
    image: lamassuiot/ca:latest
    livenessProbe:
      httpGet:
        path: /health/live
        port: 8080
      initialDelaySeconds: 30
      periodSeconds: 10
    readinessProbe:
      httpGet:
        path: /health/ready
        port: 8080
      initialDelaySeconds: 10
      periodSeconds: 5

Performance Monitoring

Resource Utilization

Monitor CPU, memory, and disk usage: CPU usage:

rate(process_cpu_seconds_total{service="ca-api"}[5m]) * 100

Memory usage:

process_resident_memory_bytes{service="ca-api"} / 1024 / 1024  # MB

Disk I/O:

rate(node_disk_io_time_seconds_total[5m])

Capacity Planning

Track growth trends: Certificate growth rate:

derive(count(certificate_info))

Database size growth:

pg_database_size_bytes{database="lamassu"}

Troubleshooting with Observability

Scenario: Slow Certificate Issuance

Check HTTP latency metrics:

histogram_quantile(0.95,
  rate(http_server_duration_bucket{http_route="/api/ca/v1/cas/{id}/certificates/sign"}[5m])
)

View distributed trace to identify bottleneck:
- Navigate to Tempo in Grafana
- Search for slow traces: { duration > 2s }
- Identify which span is taking the most time (DB query, crypto operation, etc.)

Check crypto engine metrics:

histogram_quantile(0.95,
  rate(crypto_operation_duration_seconds_bucket[5m])
) by (engine_id)

Review logs for errors:

{service="ca-api"} |= "sign" |= "error"

Scenario: Enrollment Failures

Check enrollment failure rate:

rate(est_enrollment_failed_total[5m]) by (failure_reason)

Review authentication failures:

{service="dms-manager"} |= "authentication" |= "failed"

Trace a failed enrollment:

{ span.http.route = "/.well-known/est/{dms-id}/simpleenroll" 
  && status = error }

Troubleshooting - Common issues and solutions
Backup & Restore - Data protection
Security Best Practices - Security hardening

Getting Started

Deployment

Core Concepts

User Guides

Engines & Connectors

Security

Operations

Documentation Index

​Observability Architecture

​OpenTelemetry Configuration

​Enabling Observability

​OTLP Endpoints

​Quick Start with LGTM Stack

​Key Metrics

​Certificate Authority Metrics

​Device Manager Metrics

​EST Enrollment Metrics

​HTTP API Metrics

​Database Metrics

​Crypto Engine Metrics

​Distributed Tracing

​Trace Context Propagation

​Key Trace Operations

​Viewing Traces in Grafana

​Logging

​Structured Logging

​Log Fields

​Querying Logs with LogQL

​Dashboards

​Recommended Grafana Dashboards