Sidekiq: Deploys and the Async Job Pitfall
SIGTERM/SIGKILL โ why long Sidekiq jobs die on every deploy
Starting point โ a real incident
One day a screen was found stuck in "syncing" state forever. Looking at the DB:
integ.status # => 1 (running)
integ.success_count # => 4808 / 6424
integ.latest_sync_at # => yesterday
The job is nowhere in Sidekiq's Workers / Queue / Retry / Dead. The job has vanished but the DB state still says "running" โ meaning the job died mid-execution.
Tracing the cause leads to OS signals
Two ways the OS terminates a process:
SIGTERM (graceful)
"Hey, wrap up your work and shut down nicely"
OS just sends a signal. The process handles termination itself.
Sidekiq side: stop accepting new jobs โ wait for current job to finish โ if it doesn't, push back to retry queue and exit
Ruby side:
at_exit,ensure,rescueall run
SIGKILL (forced)
"Just die now"
OS forcibly kills the process immediately. No say in the matter.
Code dies wherever it was executing
ensure,rescue,at_exitnone runDB transactions end uncommitted
Cannot be caught by signal handlers
What happens during ECS / k8s deploys
1. Spin up new container
2. Healthcheck OK โ send SIGTERM to old container
3. Wait for stopTimeout (e.g., 30~120 seconds)
4. Still alive โ SIGKILL
This "wait for stopTimeout" period is the graceful shutdown window.
Sidekiq's behavior:
1. Stop polling
2. If there's a job running, wait :timeout (default 25s)
3. Finished within 25s โ graceful shutdown โ
4. Not finished โ push job back to retry queue and try to exit
5. Still over stopTimeout โ ECS sends SIGKILL โ instant death, ensure does not run โ
How a one-hour job gets stuck
class HeavySyncService
def run
@integ.update(status: 1) # mark as running on entry
begin
6424.times do |i|
sleep 0.5 # 0.5s per item
process_one(i) # plus actual processing time
end
@integ.update(status: 77) # mark as success on normal end
rescue => e
@integ.update(status: 44) # mark as failed on exception
raise e
end
end
end
6424 items ร 0.5s โ 54 minutes. If someone merges to develop and a deploy lands while this job is running:
12:00 job starts โ status=1
12:30 deploy โ ECS sends SIGTERM to old container
โ Sidekiq tries to finish within :timeout (25s). Fails.
12:31 stopTimeout exceeded โ ECS sends SIGKILL
โ job dies instantly. ensure doesn't run. status=1 stays.
What does the user see? Refresh the page all you want โ forever "syncing (4808/6424)".
"But doesn't a dead job get re-enqueued and retried?"
Half right, half wrong.
Sidekiq's standard mechanism (under SIGTERM)
- Try to finish the running job within
:timeout - If not, push the job back to the queue (re-enqueue) and exit
- New container picks it up from the queue and starts from the beginning
โ Through this path, jobs don't disappear.
But under SIGKILL
Sidekiq itself dies instantly โ no one to push the job back
OSS Sidekiq's default fetcher uses BRPOP, which deletes the job from Redis the moment it's fetched (non-reliable)
โ When a worker is SIGKILLed, that job is just gone
With Sidekiq Pro's super_fetch or sidekiq-reliable-fetch (OSS), jobs are kept in a separate set and recovered if a worker dies, but the default OSS Sidekiq doesn't do this.
60-second jobs vs. 1-hour jobs
| Job length | Behavior on deploy |
|---|---|
| 1 second | Almost always finishes within graceful window. Negligible. |
| 60 seconds | Very high chance of finishing within stopTimeout. Even if SIGKILLed, recovers naturally on next scheduled run. |
| 5 minutes | Sometimes misses the graceful window. Operationally annoying. |
| 30min~1hour | SIGKILLed almost every time. Stuck every time. Time bomb. |
Key insight: making jobs fit inside the graceful shutdown window is the most reliable defense.
How to actually defend against this
- Keep jobs short (most important)
Remove pointless
sleepBatch heavy per-item calls (update / broadcast) into N-item chunks
Split big jobs into chunks and enqueue N jobs
- Make them idempotent
Re-running from scratch yields the same result
Persist progress per chunk so re-runs can resume
- Adopt reliable fetch
Sidekiq Pro
super_fetch(paid)sidekiq-reliable-fetch(OSS, watch maintenance status)Prevents job loss even on SIGKILL
- Stuck detection & auto re-run job (the pragmatic best answer)
Idea: "If Sidekiq died mid-execution, there must be a record stuck at
status=1in the DB. So find it periodically and re-run it."Example: run the following job via a scheduler (cron / sidekiq-cron / whenever) every hour
class RecoverStuckSyncsJob < ApplicationJob
queue_as :default
STALE_AFTER = 1.hour
def perform
Integration.where(status: :running)
.where('updated_at < ?', STALE_AFTER.ago)
.find_each do |integ|
Rails.logger.warn("[StuckRecovery] #{integ.id} stale since #{integ.updated_at}")
integ.update!(status: :failed, alert: 'auto recovery: previous run was interrupted')
SyncJob.perform_later(integ.id) # re-enqueue
end
end
end
Key points:
- If
updated_athasn't moved for a threshold period (no progress), treat as stuck - If progress exists, in-loop
updatecalls keepupdated_atfresh โ no false positives - Flip
statusto failed then re-enqueue โ releases the "running" UI state - The job itself must be idempotent (safe to re-run from scratch)
- If
Pick the interval to match your SLA. 1 hour means users can see stuck state for up to 1 hour. For sensitive screens, go to 10~15 minutes.
ensureinsurance (auxiliary)
Recovers status on non-SIGKILL exit paths
Doesn't help with SIGKILL itself, but covers some broken termination paths
Summary
Async jobs aren't "safe because they retry". Under default OSS Sidekiq, SIGKILL means the job is gone.
Deploy timing and job length are directly linked. The longer the job, the higher the chance of dying on every deploy.
A screen showing "forever running" is usually the combination of SIGKILL + ensure-not-running + the job being long in the first place.
The most effective defense isn't fancy infrastructure โ it's making jobs short.
Key Points
Imagine a job with sleep 0.5 ร 6424 items โ 54 minutes
Mid-execution, someone merges to develop and a deploy lands
ECS sends SIGTERM to the old container
Sidekiq tries to shut down within :timeout (25s) โ fails
stopTimeout exceeded โ ECS sends SIGKILL
SIGKILL skips ensure/rescue โ DB keeps status=1
OSS Sidekiq default fetcher is non-reliable, so the job itself does not return to the queue
User sees "running" forever no matter how many times they refresh
Pros
- ✓ Job separation shortens user response time (original benefit)
- ✓ Failure recovery can be automated via retries
- ✓ Most stuck incidents can be prevented just by watching job length
Cons
- ✗ Default OSS Sidekiq loses jobs on SIGKILL
- ✗ Death probability per deploy rises non-linearly with job length
- ✗ SIGKILL bypasses ensure/rescue entirely; no 100% code-level guarantee
- ✗ Adopting reliable fetch is either paid or carries OSS maintenance risk