Graceful Shutdown¶

Graceful shutdown is a critical aspect of service management in containers. Go Overlay implements a robust shutdown process that ensures all services have a chance to shut down gracefully, save state, and release resources before forced termination.

Overview¶

When the Go Overlay receives a termination signal (typically SIGTERM in Docker environments), it initiates a coordinated shutdown sequence that:

Stops accepting new services
Sends termination signals to all running services
Waits for graceful termination within configured timeouts
Forces termination of unresponsive services
Executes cleanup scripts (pos-scripts)
Terminates the supervisor

Signal Handling¶

SIGTERM (Signal 15)¶

The primary signal used for graceful shutdown. When the Go Overlay receives SIGTERM:

Behavior: - Initiates an ordered shutdown sequence - Propagates SIGTERM to all managed services - Respects configured timeouts - Allows services to save state and clean up resources

Common usage:

# Docker sends SIGTERM when executing stop
docker stop <container-id>

# Kubernetes sends SIGTERM during pod termination
kubectl delete pod <pod-name>

# Manual
kill -TERM <supervisor-pid>

SIGKILL (Signal 9)¶

A forced termination signal used as a last resort. Cannot be caught or ignored.

Behavior: - Immediate process termination - No opportunity for cleanup - Automatically triggered after timeout expires - Not recommended for manual use

When used: - Automatically by Go Overlay after service_shutdown_timeout expires - By Docker after docker stop timeout (default 10s) - Manually only in emergencies

Avoid SIGKILL

SIGKILL should be avoided whenever possible, as it does not allow adequate cleanup and can result in: - Lost unsaved data - Unclosed connections - Unreleased locks - Inconsistent state

Shutdown Sequence¶

The Go Overlay follows a specific sequence during shutdown to ensure orderly termination:

flowchart TD
    A[SIGTERM Received] --> B[Stop Accepting New Services]
    B --> C[Identify Running Services]
    C --> D[Send SIGTERM to All Services]
    D --> E{All Services\nStopped?}
    E -->|Yes| F[Execute Pos-Scripts]
    E -->|No| G{Timeout\nExpired?}
    G -->|No| E
    G -->|Yes| H[Send SIGKILL to\nRemaining Services]
    H --> I[Force Terminate]
    I --> F
    F --> J[Cleanup Resources]
    J --> K[Exit Supervisor]
    K --> L[Container Stops]

Detailed Steps¶

1. Signal Reception¶

Go Overlay receives SIGTERM
↓
Logs: "Received shutdown signal, initiating graceful shutdown..."

2. Service Enumeration¶

Identify all services in RUNNING state
↓
Order services by reverse dependency (dependents stop first)

3. SIGTERM Propagation¶

For each service:
  - Change state to STOPPING
  - Send SIGTERM to service process
  - Start shutdown timer (service_shutdown_timeout)
  - Log: "Stopping service: <name>"

4. Graceful Wait Period¶

While services are stopping:
  - Monitor process status
  - Check if processes have exited
  - Respect service_shutdown_timeout
  - Log progress

5. Timeout Handling¶

If service_shutdown_timeout expires:
  - Log: "Service <name> did not stop gracefully, forcing termination"
  - Send SIGKILL to service process
  - Wait briefly for forced termination

6. Post-Script Execution¶

For each stopped service with pos_script:
  - Execute pos_script
  - Wait for completion (with timeout)
  - Log results

7. Final Cleanup¶

- Close IPC connections
- Release file handles
- Log final status
- Exit with code 0

Timeout Configuration¶

The Go Overlay uses multiple timeouts to control the shutdown process:

service_shutdown_timeout¶

Maximum time to wait for each individual service to shut down gracefully.

[timeouts]
service_shutdown_timeout = "30s"  # Default: 30 seconds

Recommendations: - Web applications: 15-30s - Databases: 30-60s - Background workers: 30-45s - Stateless services: 10-15s

global_shutdown_timeout¶

Maximum total time for the entire shutdown process.

[timeouts]
global_shutdown_timeout = "120s"  # Default: 120 seconds

Usage: - Ensures the supervisor does not hang indefinitely - Should be greater than the sum of all service_shutdown_timeout - Considers time for pos-scripts

post_script_timeout¶

Maximum time for pos-scripts execution.

[timeouts]
post_script_timeout = "10s"  # Default: 10 seconds

Implementing Graceful Shutdown in Services¶

To benefit from graceful shutdown, your services should implement appropriate signal handlers.

Go Example¶

package main

import (
    "context"
    "log"
    "net/http"
    "os"
    "os/signal"
    "syscall"
    "time"
)

func main() {
    server := &http.Server{Addr: ":8080"}

    // Setup signal handling
    sigChan := make(chan os.Signal, 1)
    signal.Notify(sigChan, syscall.SIGTERM, syscall.SIGINT)

    // Start server in goroutine
    go func() {
        if err := server.ListenAndServe(); err != http.ErrServerClosed {
            log.Fatal(err)
        }
    }()

    log.Println("Server started on :8080")

    // Wait for signal
    <-sigChan
    log.Println("Shutdown signal received, stopping gracefully...")

    // Create shutdown context with timeout
    ctx, cancel := context.WithTimeout(context.Background(), 30*time.Second)
    defer cancel()

    // Attempt graceful shutdown
    if err := server.Shutdown(ctx); err != nil {
        log.Printf("Error during shutdown: %v", err)
    }

    log.Println("Server stopped")
}

Python Example¶

import signal
import sys
import time

class GracefulShutdown:
    def __init__(self):
        self.shutdown_requested = False
        signal.signal(signal.SIGTERM, self.handle_signal)
        signal.signal(signal.SIGINT, self.handle_signal)

    def handle_signal(self, signum, frame):
        print(f"Received signal {signum}, initiating graceful shutdown...")
        self.shutdown_requested = True

    def cleanup(self):
        print("Performing cleanup...")
        # Close database connections
        # Flush buffers
        # Save state
        time.sleep(2)  # Simulate cleanup
        print("Cleanup complete")

def main():
    shutdown_handler = GracefulShutdown()

    print("Service started")

    # Main service loop
    while not shutdown_handler.shutdown_requested:
        # Do work
        time.sleep(1)

    # Cleanup before exit
    shutdown_handler.cleanup()
    print("Service stopped")
    sys.exit(0)

if __name__ == "__main__":
    main()

Node.js Example¶

const http = require('http');

const server = http.createServer((req, res) => {
    res.writeHead(200);
    res.end('Hello World\n');
});

server.listen(3000, () => {
    console.log('Server started on port 3000');
});

// Graceful shutdown handler
function gracefulShutdown(signal) {
    console.log(`Received ${signal}, starting graceful shutdown...`);

    server.close(() => {
        console.log('HTTP server closed');

        // Close database connections
        // Flush logs
        // Save state

        console.log('Cleanup complete, exiting');
        process.exit(0);
    });

    // Force exit after timeout
    setTimeout(() => {
        console.error('Forced shutdown after timeout');
        process.exit(1);
    }, 30000);
}

process.on('SIGTERM', () => gracefulShutdown('SIGTERM'));
process.on('SIGINT', () => gracefulShutdown('SIGINT'));

Testing Graceful Shutdown¶

It is essential to test the graceful shutdown behavior of your services before putting them into production.

Local Testing¶

Test 1: Normal Graceful Shutdown¶

# Start container
docker run -d --name test-supervisor your-image

# Wait for services to start
sleep 5

# Send SIGTERM (graceful)
docker stop test-supervisor

# Check logs
docker logs test-supervisor

Expected output:

Received shutdown signal, initiating graceful shutdown...
Stopping service: nginx
Stopping service: app
Service nginx stopped gracefully
Service app stopped gracefully
All services stopped, exiting

Test 2: Timeout Behavior¶

Configure a short timeout and a service that does not respond:

[timeouts]
service_shutdown_timeout = "5s"

[[services]]
name = "slow-service"
command = "/usr/bin/slow-app"  # App that ignores SIGTERM

docker stop test-supervisor

# Observe logs
docker logs test-supervisor

Expected output:

Stopping service: slow-service
Service slow-service did not stop gracefully, forcing termination
Service slow-service terminated with SIGKILL

Test 3: Pos-Script Execution¶

[[services]]
name = "database"
command = "/usr/bin/mysqld"
pos_script = "/scripts/backup-db.sh"

docker stop test-supervisor
docker logs test-supervisor

Expected output:

Stopping service: database
Service database stopped gracefully
Executing pos-script for database: /scripts/backup-db.sh
Pos-script completed successfully

Automated Testing¶

Create automated tests to validate shutdown behavior:

#!/bin/bash
# test-shutdown.sh

set -e

echo "Testing graceful shutdown..."

# Start container
CONTAINER_ID=$(docker run -d your-image)

# Wait for startup
sleep 5

# Send stop signal
docker stop --time=60 $CONTAINER_ID

# Check exit code
EXIT_CODE=$(docker inspect $CONTAINER_ID --format='{{.State.ExitCode}}')

if [ $EXIT_CODE -eq 0 ]; then
    echo "✓ Graceful shutdown successful"
else
    echo "✗ Shutdown failed with exit code $EXIT_CODE"
    exit 1
fi

# Cleanup
docker rm $CONTAINER_ID

Best Practices¶

1. Set Appropriate Timeouts¶

[timeouts]
# Allow enough time for cleanup
service_shutdown_timeout = "30s"

# Global timeout should be sum of all service timeouts + buffer
global_shutdown_timeout = "120s"

# Pos-scripts should be quick
post_script_timeout = "10s"

2. Implement Signal Handlers¶

Always implement handlers for SIGTERM in your services:

signal.Notify(sigChan, syscall.SIGTERM)

3. Use Pos-Scripts for Cleanup¶

[[services]]
name = "app"
command = "/usr/bin/app"
pos_script = "/scripts/cleanup.sh"  # Backup, logs, etc.

4. Test Regularly¶

Include shutdown tests in your CI/CD pipeline.

5. Monitor Shutdown Time¶

# Measure shutdown time
time docker stop <container-id>

If the shutdown consistently reaches the timeout, investigate: - Services are not responding to SIGTERM - Too short timeouts - Too slow cleanup

6. Log Shutdown Progress¶

Add detailed logging to your services:

log.Println("Shutdown initiated")
log.Println("Closing database connections...")
log.Println("Flushing buffers...")
log.Println("Shutdown complete")

Troubleshooting¶

Services Not Stopping Gracefully¶

Symptoms: - Timeout always expires - SIGKILL always necessary

Solutions: - Verify if the service implements SIGTERM handler - Increase service_shutdown_timeout - Test the service manually with kill -TERM

Container Takes Too Long to Stop¶

Symptoms: - docker stop takes too long - Docker timeout is reached

Solutions: - Reduce service_shutdown_timeout if appropriate - Optimize service cleanup - Use docker stop --time=<seconds> with a larger value

Pos-Scripts Not Executing¶

Symptoms: - Cleanup scripts do not execute - Logs do not show pos-script execution

Solutions: - Check script permissions - Test script manually - Verify post_script_timeout - Confirm path is correct

Service Lifecycle - States and transitions of services
Timeouts Configuration - Detailed timeout configuration
Health Monitoring - Health monitoring

Graceful Shutdown¶

Overview¶

Signal Handling¶

SIGTERM (Signal 15)¶

SIGKILL (Signal 9)¶

Shutdown Sequence¶

Detailed Steps¶

1. Signal Reception¶

2. Service Enumeration¶

3. SIGTERM Propagation¶

4. Graceful Wait Period¶

5. Timeout Handling¶

6. Post-Script Execution¶

7. Final Cleanup¶

Timeout Configuration¶

service_shutdown_timeout¶

global_shutdown_timeout¶

post_script_timeout¶

Implementing Graceful Shutdown in Services¶

Go Example¶

Python Example¶

Node.js Example¶

Testing Graceful Shutdown¶

Local Testing¶

Test 1: Normal Graceful Shutdown¶

Test 2: Timeout Behavior¶

Test 3: Pos-Script Execution¶

Automated Testing¶

Best Practices¶

1. Set Appropriate Timeouts¶

2. Implement Signal Handlers¶

3. Use Pos-Scripts for Cleanup¶

4. Test Regularly¶

5. Monitor Shutdown Time¶

6. Log Shutdown Progress¶

Troubleshooting¶

Services Not Stopping Gracefully¶

Container Takes Too Long to Stop¶

Pos-Scripts Not Executing¶

Related Topics¶