Skip to content

Graceful Shutdown

Graceful shutdown is a critical aspect of service management in containers. Go Overlay implements a robust shutdown process that ensures all services have a chance to shut down gracefully, save state, and release resources before forced termination.

Overview

When the Go Overlay receives a termination signal (typically SIGTERM in Docker environments), it initiates a coordinated shutdown sequence that:

  1. Stops accepting new services
  2. Sends termination signals to all running services
  3. Waits for graceful termination within configured timeouts
  4. Forces termination of unresponsive services
  5. Executes cleanup scripts (pos-scripts)
  6. Terminates the supervisor

Signal Handling

SIGTERM (Signal 15)

The primary signal used for graceful shutdown. When the Go Overlay receives SIGTERM:

Behavior: - Initiates an ordered shutdown sequence - Propagates SIGTERM to all managed services - Respects configured timeouts - Allows services to save state and clean up resources

Common usage:

# Docker sends SIGTERM when executing stop
docker stop <container-id>

# Kubernetes sends SIGTERM during pod termination
kubectl delete pod <pod-name>

# Manual
kill -TERM <supervisor-pid>

SIGKILL (Signal 9)

A forced termination signal used as a last resort. Cannot be caught or ignored.

Behavior: - Immediate process termination - No opportunity for cleanup - Automatically triggered after timeout expires - Not recommended for manual use

When used: - Automatically by Go Overlay after service_shutdown_timeout expires - By Docker after docker stop timeout (default 10s) - Manually only in emergencies

Avoid SIGKILL

SIGKILL should be avoided whenever possible, as it does not allow adequate cleanup and can result in: - Lost unsaved data - Unclosed connections - Unreleased locks - Inconsistent state

Shutdown Sequence

The Go Overlay follows a specific sequence during shutdown to ensure orderly termination:

flowchart TD
    A[SIGTERM Received] --> B[Stop Accepting New Services]
    B --> C[Identify Running Services]
    C --> D[Send SIGTERM to All Services]
    D --> E{All Services\nStopped?}
    E -->|Yes| F[Execute Pos-Scripts]
    E -->|No| G{Timeout\nExpired?}
    G -->|No| E
    G -->|Yes| H[Send SIGKILL to\nRemaining Services]
    H --> I[Force Terminate]
    I --> F
    F --> J[Cleanup Resources]
    J --> K[Exit Supervisor]
    K --> L[Container Stops]

Detailed Steps

1. Signal Reception

Go Overlay receives SIGTERM
Logs: "Received shutdown signal, initiating graceful shutdown..."

2. Service Enumeration

Identify all services in RUNNING state
Order services by reverse dependency (dependents stop first)

3. SIGTERM Propagation

For each service:
  - Change state to STOPPING
  - Send SIGTERM to service process
  - Start shutdown timer (service_shutdown_timeout)
  - Log: "Stopping service: <name>"

4. Graceful Wait Period

While services are stopping:
  - Monitor process status
  - Check if processes have exited
  - Respect service_shutdown_timeout
  - Log progress

5. Timeout Handling

If service_shutdown_timeout expires:
  - Log: "Service <name> did not stop gracefully, forcing termination"
  - Send SIGKILL to service process
  - Wait briefly for forced termination

6. Post-Script Execution

For each stopped service with pos_script:
  - Execute pos_script
  - Wait for completion (with timeout)
  - Log results

7. Final Cleanup

- Close IPC connections
- Release file handles
- Log final status
- Exit with code 0

Timeout Configuration

The Go Overlay uses multiple timeouts to control the shutdown process:

service_shutdown_timeout

Maximum time to wait for each individual service to shut down gracefully.

[timeouts]
service_shutdown_timeout = "30s"  # Default: 30 seconds

Recommendations: - Web applications: 15-30s - Databases: 30-60s - Background workers: 30-45s - Stateless services: 10-15s

global_shutdown_timeout

Maximum total time for the entire shutdown process.

[timeouts]
global_shutdown_timeout = "120s"  # Default: 120 seconds

Usage: - Ensures the supervisor does not hang indefinitely - Should be greater than the sum of all service_shutdown_timeout - Considers time for pos-scripts

post_script_timeout

Maximum time for pos-scripts execution.

[timeouts]
post_script_timeout = "10s"  # Default: 10 seconds

Implementing Graceful Shutdown in Services

To benefit from graceful shutdown, your services should implement appropriate signal handlers.

Go Example

package main

import (
    "context"
    "log"
    "net/http"
    "os"
    "os/signal"
    "syscall"
    "time"
)

func main() {
    server := &http.Server{Addr: ":8080"}

    // Setup signal handling
    sigChan := make(chan os.Signal, 1)
    signal.Notify(sigChan, syscall.SIGTERM, syscall.SIGINT)

    // Start server in goroutine
    go func() {
        if err := server.ListenAndServe(); err != http.ErrServerClosed {
            log.Fatal(err)
        }
    }()

    log.Println("Server started on :8080")

    // Wait for signal
    <-sigChan
    log.Println("Shutdown signal received, stopping gracefully...")

    // Create shutdown context with timeout
    ctx, cancel := context.WithTimeout(context.Background(), 30*time.Second)
    defer cancel()

    // Attempt graceful shutdown
    if err := server.Shutdown(ctx); err != nil {
        log.Printf("Error during shutdown: %v", err)
    }

    log.Println("Server stopped")
}

Python Example

import signal
import sys
import time

class GracefulShutdown:
    def __init__(self):
        self.shutdown_requested = False
        signal.signal(signal.SIGTERM, self.handle_signal)
        signal.signal(signal.SIGINT, self.handle_signal)

    def handle_signal(self, signum, frame):
        print(f"Received signal {signum}, initiating graceful shutdown...")
        self.shutdown_requested = True

    def cleanup(self):
        print("Performing cleanup...")
        # Close database connections
        # Flush buffers
        # Save state
        time.sleep(2)  # Simulate cleanup
        print("Cleanup complete")

def main():
    shutdown_handler = GracefulShutdown()

    print("Service started")

    # Main service loop
    while not shutdown_handler.shutdown_requested:
        # Do work
        time.sleep(1)

    # Cleanup before exit
    shutdown_handler.cleanup()
    print("Service stopped")
    sys.exit(0)

if __name__ == "__main__":
    main()

Node.js Example

const http = require('http');

const server = http.createServer((req, res) => {
    res.writeHead(200);
    res.end('Hello World\n');
});

server.listen(3000, () => {
    console.log('Server started on port 3000');
});

// Graceful shutdown handler
function gracefulShutdown(signal) {
    console.log(`Received ${signal}, starting graceful shutdown...`);

    server.close(() => {
        console.log('HTTP server closed');

        // Close database connections
        // Flush logs
        // Save state

        console.log('Cleanup complete, exiting');
        process.exit(0);
    });

    // Force exit after timeout
    setTimeout(() => {
        console.error('Forced shutdown after timeout');
        process.exit(1);
    }, 30000);
}

process.on('SIGTERM', () => gracefulShutdown('SIGTERM'));
process.on('SIGINT', () => gracefulShutdown('SIGINT'));

Testing Graceful Shutdown

It is essential to test the graceful shutdown behavior of your services before putting them into production.

Local Testing

Test 1: Normal Graceful Shutdown

# Start container
docker run -d --name test-supervisor your-image

# Wait for services to start
sleep 5

# Send SIGTERM (graceful)
docker stop test-supervisor

# Check logs
docker logs test-supervisor

Expected output:

Received shutdown signal, initiating graceful shutdown...
Stopping service: nginx
Stopping service: app
Service nginx stopped gracefully
Service app stopped gracefully
All services stopped, exiting

Test 2: Timeout Behavior

Configure a short timeout and a service that does not respond:

[timeouts]
service_shutdown_timeout = "5s"

[[services]]
name = "slow-service"
command = "/usr/bin/slow-app"  # App that ignores SIGTERM
docker stop test-supervisor

# Observe logs
docker logs test-supervisor

Expected output:

Stopping service: slow-service
Service slow-service did not stop gracefully, forcing termination
Service slow-service terminated with SIGKILL

Test 3: Pos-Script Execution

[[services]]
name = "database"
command = "/usr/bin/mysqld"
pos_script = "/scripts/backup-db.sh"
docker stop test-supervisor
docker logs test-supervisor

Expected output:

Stopping service: database
Service database stopped gracefully
Executing pos-script for database: /scripts/backup-db.sh
Pos-script completed successfully

Automated Testing

Create automated tests to validate shutdown behavior:

#!/bin/bash
# test-shutdown.sh

set -e

echo "Testing graceful shutdown..."

# Start container
CONTAINER_ID=$(docker run -d your-image)

# Wait for startup
sleep 5

# Send stop signal
docker stop --time=60 $CONTAINER_ID

# Check exit code
EXIT_CODE=$(docker inspect $CONTAINER_ID --format='{{.State.ExitCode}}')

if [ $EXIT_CODE -eq 0 ]; then
    echo "✓ Graceful shutdown successful"
else
    echo "✗ Shutdown failed with exit code $EXIT_CODE"
    exit 1
fi

# Cleanup
docker rm $CONTAINER_ID

Best Practices

1. Set Appropriate Timeouts

[timeouts]
# Allow enough time for cleanup
service_shutdown_timeout = "30s"

# Global timeout should be sum of all service timeouts + buffer
global_shutdown_timeout = "120s"

# Pos-scripts should be quick
post_script_timeout = "10s"

2. Implement Signal Handlers

Always implement handlers for SIGTERM in your services:

signal.Notify(sigChan, syscall.SIGTERM)

3. Use Pos-Scripts for Cleanup

[[services]]
name = "app"
command = "/usr/bin/app"
pos_script = "/scripts/cleanup.sh"  # Backup, logs, etc.

4. Test Regularly

Include shutdown tests in your CI/CD pipeline.

5. Monitor Shutdown Time

# Measure shutdown time
time docker stop <container-id>

If the shutdown consistently reaches the timeout, investigate: - Services are not responding to SIGTERM - Too short timeouts - Too slow cleanup

6. Log Shutdown Progress

Add detailed logging to your services:

log.Println("Shutdown initiated")
log.Println("Closing database connections...")
log.Println("Flushing buffers...")
log.Println("Shutdown complete")

Troubleshooting

Services Not Stopping Gracefully

Symptoms: - Timeout always expires - SIGKILL always necessary

Solutions: - Verify if the service implements SIGTERM handler - Increase service_shutdown_timeout - Test the service manually with kill -TERM

Container Takes Too Long to Stop

Symptoms: - docker stop takes too long - Docker timeout is reached

Solutions: - Reduce service_shutdown_timeout if appropriate - Optimize service cleanup - Use docker stop --time=<seconds> with a larger value

Pos-Scripts Not Executing

Symptoms: - Cleanup scripts do not execute - Logs do not show pos-script execution

Solutions: - Check script permissions - Test script manually - Verify post_script_timeout - Confirm path is correct