Graceful Shutdown¶
Graceful shutdown is a critical aspect of service management in containers. Go Overlay implements a robust shutdown process that ensures all services have a chance to shut down gracefully, save state, and release resources before forced termination.
Overview¶
When the Go Overlay receives a termination signal (typically SIGTERM in Docker environments), it initiates a coordinated shutdown sequence that:
- Stops accepting new services
- Sends termination signals to all running services
- Waits for graceful termination within configured timeouts
- Forces termination of unresponsive services
- Executes cleanup scripts (pos-scripts)
- Terminates the supervisor
Signal Handling¶
SIGTERM (Signal 15)¶
The primary signal used for graceful shutdown. When the Go Overlay receives SIGTERM:
Behavior: - Initiates an ordered shutdown sequence - Propagates SIGTERM to all managed services - Respects configured timeouts - Allows services to save state and clean up resources
Common usage:
# Docker sends SIGTERM when executing stop
docker stop <container-id>
# Kubernetes sends SIGTERM during pod termination
kubectl delete pod <pod-name>
# Manual
kill -TERM <supervisor-pid>
SIGKILL (Signal 9)¶
A forced termination signal used as a last resort. Cannot be caught or ignored.
Behavior: - Immediate process termination - No opportunity for cleanup - Automatically triggered after timeout expires - Not recommended for manual use
When used: - Automatically by Go Overlay after service_shutdown_timeout expires - By Docker after docker stop timeout (default 10s) - Manually only in emergencies
Avoid SIGKILL
SIGKILL should be avoided whenever possible, as it does not allow adequate cleanup and can result in: - Lost unsaved data - Unclosed connections - Unreleased locks - Inconsistent state
Shutdown Sequence¶
The Go Overlay follows a specific sequence during shutdown to ensure orderly termination:
flowchart TD
A[SIGTERM Received] --> B[Stop Accepting New Services]
B --> C[Identify Running Services]
C --> D[Send SIGTERM to All Services]
D --> E{All Services\nStopped?}
E -->|Yes| F[Execute Pos-Scripts]
E -->|No| G{Timeout\nExpired?}
G -->|No| E
G -->|Yes| H[Send SIGKILL to\nRemaining Services]
H --> I[Force Terminate]
I --> F
F --> J[Cleanup Resources]
J --> K[Exit Supervisor]
K --> L[Container Stops] Detailed Steps¶
1. Signal Reception¶
2. Service Enumeration¶
Identify all services in RUNNING state
↓
Order services by reverse dependency (dependents stop first)
3. SIGTERM Propagation¶
For each service:
- Change state to STOPPING
- Send SIGTERM to service process
- Start shutdown timer (service_shutdown_timeout)
- Log: "Stopping service: <name>"
4. Graceful Wait Period¶
While services are stopping:
- Monitor process status
- Check if processes have exited
- Respect service_shutdown_timeout
- Log progress
5. Timeout Handling¶
If service_shutdown_timeout expires:
- Log: "Service <name> did not stop gracefully, forcing termination"
- Send SIGKILL to service process
- Wait briefly for forced termination
6. Post-Script Execution¶
For each stopped service with pos_script:
- Execute pos_script
- Wait for completion (with timeout)
- Log results
7. Final Cleanup¶
Timeout Configuration¶
The Go Overlay uses multiple timeouts to control the shutdown process:
service_shutdown_timeout¶
Maximum time to wait for each individual service to shut down gracefully.
Recommendations: - Web applications: 15-30s - Databases: 30-60s - Background workers: 30-45s - Stateless services: 10-15s
global_shutdown_timeout¶
Maximum total time for the entire shutdown process.
Usage: - Ensures the supervisor does not hang indefinitely - Should be greater than the sum of all service_shutdown_timeout - Considers time for pos-scripts
post_script_timeout¶
Maximum time for pos-scripts execution.
Implementing Graceful Shutdown in Services¶
To benefit from graceful shutdown, your services should implement appropriate signal handlers.
Go Example¶
package main
import (
"context"
"log"
"net/http"
"os"
"os/signal"
"syscall"
"time"
)
func main() {
server := &http.Server{Addr: ":8080"}
// Setup signal handling
sigChan := make(chan os.Signal, 1)
signal.Notify(sigChan, syscall.SIGTERM, syscall.SIGINT)
// Start server in goroutine
go func() {
if err := server.ListenAndServe(); err != http.ErrServerClosed {
log.Fatal(err)
}
}()
log.Println("Server started on :8080")
// Wait for signal
<-sigChan
log.Println("Shutdown signal received, stopping gracefully...")
// Create shutdown context with timeout
ctx, cancel := context.WithTimeout(context.Background(), 30*time.Second)
defer cancel()
// Attempt graceful shutdown
if err := server.Shutdown(ctx); err != nil {
log.Printf("Error during shutdown: %v", err)
}
log.Println("Server stopped")
}
Python Example¶
import signal
import sys
import time
class GracefulShutdown:
def __init__(self):
self.shutdown_requested = False
signal.signal(signal.SIGTERM, self.handle_signal)
signal.signal(signal.SIGINT, self.handle_signal)
def handle_signal(self, signum, frame):
print(f"Received signal {signum}, initiating graceful shutdown...")
self.shutdown_requested = True
def cleanup(self):
print("Performing cleanup...")
# Close database connections
# Flush buffers
# Save state
time.sleep(2) # Simulate cleanup
print("Cleanup complete")
def main():
shutdown_handler = GracefulShutdown()
print("Service started")
# Main service loop
while not shutdown_handler.shutdown_requested:
# Do work
time.sleep(1)
# Cleanup before exit
shutdown_handler.cleanup()
print("Service stopped")
sys.exit(0)
if __name__ == "__main__":
main()
Node.js Example¶
const http = require('http');
const server = http.createServer((req, res) => {
res.writeHead(200);
res.end('Hello World\n');
});
server.listen(3000, () => {
console.log('Server started on port 3000');
});
// Graceful shutdown handler
function gracefulShutdown(signal) {
console.log(`Received ${signal}, starting graceful shutdown...`);
server.close(() => {
console.log('HTTP server closed');
// Close database connections
// Flush logs
// Save state
console.log('Cleanup complete, exiting');
process.exit(0);
});
// Force exit after timeout
setTimeout(() => {
console.error('Forced shutdown after timeout');
process.exit(1);
}, 30000);
}
process.on('SIGTERM', () => gracefulShutdown('SIGTERM'));
process.on('SIGINT', () => gracefulShutdown('SIGINT'));
Testing Graceful Shutdown¶
It is essential to test the graceful shutdown behavior of your services before putting them into production.
Local Testing¶
Test 1: Normal Graceful Shutdown¶
# Start container
docker run -d --name test-supervisor your-image
# Wait for services to start
sleep 5
# Send SIGTERM (graceful)
docker stop test-supervisor
# Check logs
docker logs test-supervisor
Expected output:
Received shutdown signal, initiating graceful shutdown...
Stopping service: nginx
Stopping service: app
Service nginx stopped gracefully
Service app stopped gracefully
All services stopped, exiting
Test 2: Timeout Behavior¶
Configure a short timeout and a service that does not respond:
[timeouts]
service_shutdown_timeout = "5s"
[[services]]
name = "slow-service"
command = "/usr/bin/slow-app" # App that ignores SIGTERM
Expected output:
Stopping service: slow-service
Service slow-service did not stop gracefully, forcing termination
Service slow-service terminated with SIGKILL
Test 3: Pos-Script Execution¶
Expected output:
Stopping service: database
Service database stopped gracefully
Executing pos-script for database: /scripts/backup-db.sh
Pos-script completed successfully
Automated Testing¶
Create automated tests to validate shutdown behavior:
#!/bin/bash
# test-shutdown.sh
set -e
echo "Testing graceful shutdown..."
# Start container
CONTAINER_ID=$(docker run -d your-image)
# Wait for startup
sleep 5
# Send stop signal
docker stop --time=60 $CONTAINER_ID
# Check exit code
EXIT_CODE=$(docker inspect $CONTAINER_ID --format='{{.State.ExitCode}}')
if [ $EXIT_CODE -eq 0 ]; then
echo "✓ Graceful shutdown successful"
else
echo "✗ Shutdown failed with exit code $EXIT_CODE"
exit 1
fi
# Cleanup
docker rm $CONTAINER_ID
Best Practices¶
1. Set Appropriate Timeouts¶
[timeouts]
# Allow enough time for cleanup
service_shutdown_timeout = "30s"
# Global timeout should be sum of all service timeouts + buffer
global_shutdown_timeout = "120s"
# Pos-scripts should be quick
post_script_timeout = "10s"
2. Implement Signal Handlers¶
Always implement handlers for SIGTERM in your services:
3. Use Pos-Scripts for Cleanup¶
[[services]]
name = "app"
command = "/usr/bin/app"
pos_script = "/scripts/cleanup.sh" # Backup, logs, etc.
4. Test Regularly¶
Include shutdown tests in your CI/CD pipeline.
5. Monitor Shutdown Time¶
If the shutdown consistently reaches the timeout, investigate: - Services are not responding to SIGTERM - Too short timeouts - Too slow cleanup
6. Log Shutdown Progress¶
Add detailed logging to your services:
log.Println("Shutdown initiated")
log.Println("Closing database connections...")
log.Println("Flushing buffers...")
log.Println("Shutdown complete")
Troubleshooting¶
Services Not Stopping Gracefully¶
Symptoms: - Timeout always expires - SIGKILL always necessary
Solutions: - Verify if the service implements SIGTERM handler - Increase service_shutdown_timeout - Test the service manually with kill -TERM
Container Takes Too Long to Stop¶
Symptoms: - docker stop takes too long - Docker timeout is reached
Solutions: - Reduce service_shutdown_timeout if appropriate - Optimize service cleanup - Use docker stop --time=<seconds> with a larger value
Pos-Scripts Not Executing¶
Symptoms: - Cleanup scripts do not execute - Logs do not show pos-script execution
Solutions: - Check script permissions - Test script manually - Verify post_script_timeout - Confirm path is correct
Related Topics¶
- Service Lifecycle - States and transitions of services
- Timeouts Configuration - Detailed timeout configuration
- Health Monitoring - Health monitoring