Health Monitoring¶

O Go Overlay monitora continuamente a saúde de todos os serviços gerenciados, detectando falhas e tomando ações apropriadas para manter a estabilidade do sistema. Compreender como o monitoramento funciona é essencial para configurar serviços críticos e garantir alta disponibilidade.

Overview¶

O sistema de monitoramento do Go Overlay rastreia:

Process Status: Se o processo está em execução
Exit Codes: Códigos de saída quando processos terminam
Service States: Transições entre estados (RUNNING, FAILED, etc.)
Critical Service Failures: Falhas em serviços marcados como required

Failure Detection¶

Process Monitoring¶

O Go Overlay monitora cada serviço através do PID (Process ID) do processo:

Detecção de falha: - Processo termina inesperadamente - Exit code não-zero - Processo não pode ser iniciado - Processo crashou

Frequência de verificação: - Monitoramento contínuo via signal handling - Detecção imediata quando processo termina

Exit Code Interpretation¶

Quando um serviço termina, o Go Overlay interpreta o exit code:

Exit Code	Significado	Ação
0	Término normal/graceful	Marca como STOPPED
1-255	Erro/falha	Marca como FAILED
137 (SIGKILL)	Término forçado	Marca como STOPPED
143 (SIGTERM)	Término graceful	Marca como STOPPED

Example log output:

Service 'api-server' exited with code 1
Service 'api-server' state changed: RUNNING -> FAILED

Critical Services (required flag)¶

O campo required marca um serviço como crítico para o sistema. Se um serviço crítico falhar, o Go Overlay inicia um shutdown completo do sistema.

Configuration¶

[[services]]
name = "database"
command = "/usr/bin/mysqld"
required = true  # Sistema encerra se este serviço falhar

Behavior¶

When a required service fails:

Serviço entra em estado FAILED
Log de erro é gerado
Go Overlay inicia shutdown de todos os serviços
Container/sistema encerra

Example scenario:

[[services]]
name = "postgres"
command = "/usr/bin/postgres"
required = true

[[services]]
name = "redis"
command = "/usr/bin/redis-server"
required = false

[[services]]
name = "api"
command = "/app/api"
depends_on = ["postgres", "redis"]
required = true

Failure scenarios:

Service Failed	Required?	Result
postgres	Yes	System shutdown initiated
redis	No	Only redis stops, others continue
api	Yes	System shutdown initiated

Log output when required service fails:

ERROR: Critical service 'postgres' failed with exit code 1
Initiating system shutdown due to critical service failure
Stopping all services...

When to Use required = true¶

Use required = true para serviços que são absolutamente essenciais:

✓ Good candidates: - Banco de dados principal - Serviço de autenticação - API principal da aplicação - Serviços sem os quais o sistema não pode funcionar

✗ Avoid for: - Serviços de logging - Serviços de métricas - Serviços opcionais - Workers que podem ser reiniciados

Example configuration:

# Critical services
[[services]]
name = "database"
command = "/usr/bin/mysqld"
required = true  # ✓ Sistema não funciona sem DB

[[services]]
name = "api-server"
command = "/app/api"
depends_on = ["database"]
required = true  # ✓ API é o serviço principal

# Non-critical services
[[services]]
name = "metrics-exporter"
command = "/app/metrics"
required = false  # ✗ Métricas são úteis mas não críticas

[[services]]
name = "background-worker"
command = "/app/worker"
required = false  # ✗ Workers podem falhar e ser reiniciados

System Shutdown on Critical Failure¶

Quando um serviço crítico falha, o Go Overlay executa um shutdown ordenado:

Shutdown Sequence¶

flowchart TD
    A[Critical Service Fails] --> B[Log Critical Failure]
    B --> C[Mark Service as FAILED]
    C --> D[Initiate System Shutdown]
    D --> E[Stop All Running Services]
    E --> F[Execute Pos-Scripts]
    F --> G[Cleanup Resources]
    G --> H[Exit Supervisor]
    H --> I[Container Stops]

Detailed Steps¶

Failure Detection

Service 'database' process terminated with exit code 1

Critical Failure Logged

CRITICAL: Required service 'database' has failed
System cannot continue without this service

Shutdown Initiated

Initiating graceful shutdown due to critical service failure

Services Stopped

Stopping service: api-server
Stopping service: redis
Stopping service: nginx

Cleanup

Executing pos-scripts...
Releasing resources...

Exit
```
Supervisor exiting with code 1
```

Monitoring Best Practices¶

1. Identify Critical Services¶

Analise sua arquitetura e identifique serviços verdadeiramente críticos:

# Example: E-commerce application

[[services]]
name = "postgres"
command = "/usr/bin/postgres"
required = true  # ✓ Sem DB, nada funciona

[[services]]
name = "redis-cache"
command = "/usr/bin/redis-server"
required = false  # ✗ Cache pode falhar, app continua (mais lento)

[[services]]
name = "payment-api"
command = "/app/payment"
depends_on = ["postgres"]
required = true  # ✓ Pagamentos são críticos

[[services]]
name = "recommendation-engine"
command = "/app/recommendations"
required = false  # ✗ Recomendações são nice-to-have

2. Implement Health Checks¶

Adicione health checks nos seus serviços para detectar problemas antes de falhas completas:

// Example: HTTP health check endpoint
func healthHandler(w http.ResponseWriter, r *http.Request) {
    // Check database connection
    if err := db.Ping(); err != nil {
        w.WriteHeader(http.StatusServiceUnavailable)
        json.NewEncoder(w).Encode(map[string]string{
            "status": "unhealthy",
            "reason": "database connection failed",
        })
        return
    }

    // Check Redis connection
    if err := redis.Ping(); err != nil {
        w.WriteHeader(http.StatusServiceUnavailable)
        json.NewEncoder(w).Encode(map[string]string{
            "status": "unhealthy",
            "reason": "redis connection failed",
        })
        return
    }

    w.WriteHeader(http.StatusOK)
    json.NewEncoder(w).Encode(map[string]string{
        "status": "healthy",
    })
}

3. Log Appropriately¶

Implemente logging detalhado para facilitar diagnóstico:

// Log startup
log.Println("Service starting...")

// Log health status
log.Println("Database connection: OK")
log.Println("Redis connection: OK")

// Log errors
log.Printf("ERROR: Failed to connect to database: %v", err)

// Log shutdown
log.Println("Shutting down gracefully...")

4. Handle Errors Gracefully¶

Implemente retry logic e fallbacks quando apropriado:

// Retry database connection
func connectWithRetry(maxRetries int) error {
    for i := 0; i < maxRetries; i++ {
        conn, err := sql.Open("postgres", dsn)
        if err == nil {
            return nil
        }

        log.Printf("Connection attempt %d failed: %v", i+1, err)
        time.Sleep(time.Second * time.Duration(i+1))
    }

    return fmt.Errorf("failed to connect after %d attempts", maxRetries)
}

5. Monitor Service Status¶

Use o CLI para monitorar status dos serviços:

# Check all services
go-overlay status

# Watch for changes
watch -n 2 go-overlay status

6. Set Up External Monitoring¶

Implemente monitoramento externo para detectar problemas:

# Example: Docker healthcheck
healthcheck:
  test: ["CMD", "curl", "-f", "http://localhost:8080/health"]
  interval: 30s
  timeout: 10s
  retries: 3
  start_period: 40s

7. Use Pos-Scripts for Cleanup¶

Configure pos-scripts para cleanup quando serviços falham:

[[services]]
name = "database"
command = "/usr/bin/mysqld"
required = true
pos_script = "/scripts/db-backup.sh"  # Backup antes de encerrar

Monitoring Patterns¶

Pattern 1: Database with Health Check¶

[[services]]
name = "postgres"
command = "/usr/bin/postgres"
args = ["-D", "/var/lib/postgresql/data"]
user = "postgres"
required = true
pre_script = "/scripts/init-db.sh"
pos_script = "/scripts/backup-db.sh"

#!/bin/bash
# init-db.sh - Verify database is ready
until pg_isready -U postgres; do
  echo "Waiting for postgres..."
  sleep 2
done
echo "PostgreSQL is ready"

Pattern 2: API with Dependency Monitoring¶

[[services]]
name = "api"
command = "/app/api"
depends_on = ["postgres", "redis"]
wait_after = "5s"
required = true

// API service with health monitoring
func main() {
    // Check dependencies on startup
    if err := checkDependencies(); err != nil {
        log.Fatalf("Dependency check failed: %v", err)
    }

    // Start health check endpoint
    go startHealthCheck()

    // Start main service
    startServer()
}

func checkDependencies() error {
    if err := db.Ping(); err != nil {
        return fmt.Errorf("database not available: %w", err)
    }

    if err := redis.Ping(); err != nil {
        return fmt.Errorf("redis not available: %w", err)
    }

    return nil
}

Pattern 3: Worker with Graceful Degradation¶

[[services]]
name = "worker"
command = "/app/worker"
depends_on = ["rabbitmq"]
required = false  # Worker can fail without stopping system

// Worker with error handling
func processJobs() {
    for {
        job, err := queue.Dequeue()
        if err != nil {
            log.Printf("Failed to dequeue: %v", err)
            time.Sleep(5 * time.Second)
            continue
        }

        if err := processJob(job); err != nil {
            log.Printf("Job processing failed: %v", err)
            // Requeue or dead letter
            queue.Requeue(job)
        }
    }
}

Troubleshooting¶

Service Keeps Failing¶

Sintomas: - Serviço entra em FAILED repetidamente - Container reinicia constantemente

Diagnóstico:

# Check logs
docker logs <container-id>

# Check service status
go-overlay status

# Check exit codes
docker inspect <container-id> | grep ExitCode

Soluções:

Verifique configuração do serviço
Teste comando manualmente
Verifique dependências
Revise logs de erro
Verifique recursos (memória, CPU)

System Shuts Down Unexpectedly¶

Sintomas: - Container encerra sem aviso - Todos os serviços param

Diagnóstico:

# Check for critical service failures
docker logs <container-id> | grep CRITICAL

# Check which service failed
docker logs <container-id> | grep FAILED

Soluções:

Identifique qual serviço required falhou
Corrija o problema no serviço
Considere se o serviço realmente precisa ser required
Implemente retry logic no serviço

False Positive Failures¶

Sintomas: - Serviço marcado como FAILED mas está funcionando - Exit codes incorretos

Soluções:

Verifique se o serviço retorna exit code 0 em shutdown graceful
Implemente signal handlers apropriados
Revise lógica de término do serviço

Monitoring Metrics¶

Key Metrics to Track¶

Service Uptime
Tempo que cada serviço está RUNNING
Frequência de falhas
Restart Count
Quantas vezes serviços foram reiniciados
Padrões de falha
Startup Time
Tempo de PENDING para RUNNING
Tempo de resolução de dependências
Shutdown Time
Tempo de STOPPING para STOPPED
Frequência de timeouts
Failure Rate
Porcentagem de falhas vs. sucessos
Serviços mais problemáticos

Example Monitoring Script¶

#!/bin/bash
# monitor-services.sh

while true; do
    STATUS=$(go-overlay status)

    # Count services by state
    RUNNING=$(echo "$STATUS" | grep -c "RUNNING")
    FAILED=$(echo "$STATUS" | grep -c "FAILED")

    # Log metrics
    echo "$(date): RUNNING=$RUNNING FAILED=$FAILED"

    # Alert if failures detected
    if [ $FAILED -gt 0 ]; then
        echo "ALERT: $FAILED service(s) in FAILED state"
    fi

    sleep 30
done

Service Lifecycle - Estados e transições de serviços
Graceful Shutdown - Processo de shutdown
Services Configuration - Configuração do campo required
Dependency Management - Como dependências afetam saúde

Health Monitoring¶

Overview¶

Failure Detection¶

Process Monitoring¶

Exit Code Interpretation¶

Critical Services (required flag)¶

Configuration¶

Behavior¶

When to Use required = true¶

System Shutdown on Critical Failure¶

Shutdown Sequence¶

Detailed Steps¶

Monitoring Best Practices¶

1. Identify Critical Services¶

2. Implement Health Checks¶

3. Log Appropriately¶

4. Handle Errors Gracefully¶

5. Monitor Service Status¶

6. Set Up External Monitoring¶

7. Use Pos-Scripts for Cleanup¶

Monitoring Patterns¶

Pattern 1: Database with Health Check¶

Pattern 2: API with Dependency Monitoring¶

Pattern 3: Worker with Graceful Degradation¶

Troubleshooting¶

Service Keeps Failing¶

System Shuts Down Unexpectedly¶

False Positive Failures¶

Monitoring Metrics¶

Key Metrics to Track¶

Example Monitoring Script¶

Related Topics¶