💀

Sidekiq: Deploys and the Async Job Pitfall

SIGTERM/SIGKILL — why long Sidekiq jobs die on every deploy

Starting point — a real incident

One day a screen was found stuck in "syncing" state forever. Looking at the DB:

integ.status            # => 1  (running)
integ.success_count     # => 4808 / 6424
integ.latest_sync_at    # => yesterday

The job is nowhere in Sidekiq's Workers / Queue / Retry / Dead. The job has vanished but the DB state still says "running" — meaning the job died mid-execution.

Tracing the cause leads to OS signals

Two ways the OS terminates a process:

SIGTERM (graceful)

"Hey, wrap up your work and shut down nicely"

OS just sends a signal. The process handles termination itself.
Sidekiq side: stop accepting new jobs → wait for current job to finish → if it doesn't, push back to retry queue and exit
Ruby side: at_exit, ensure, rescue all run

SIGKILL (forced)

"Just die now"

OS forcibly kills the process immediately. No say in the matter.
Code dies wherever it was executing
ensure, rescue, at_exit none run
DB transactions end uncommitted
Cannot be caught by signal handlers

What happens during ECS / k8s deploys

1. Spin up new container
2. Healthcheck OK → send SIGTERM to old container
3. Wait for stopTimeout (e.g., 30~120 seconds)
4. Still alive → SIGKILL

This "wait for stopTimeout" period is the graceful shutdown window.

Sidekiq's behavior:
1. Stop polling
2. If there's a job running, wait :timeout (default 25s)
3. Finished within 25s → graceful shutdown ✅
4. Not finished → push job back to retry queue and try to exit
5. Still over stopTimeout → ECS sends SIGKILL → instant death, ensure does not run ❌

How a one-hour job gets stuck

class HeavySyncService
  def run
    @integ.update(status: 1)            # mark as running on entry
    begin
      6424.times do |i|
        sleep 0.5                        # 0.5s per item
        process_one(i)                   # plus actual processing time
      end
      @integ.update(status: 77)          # mark as success on normal end
    rescue => e
      @integ.update(status: 44)          # mark as failed on exception
      raise e
    end
  end
end

6424 items × 0.5s ≈ 54 minutes. If someone merges to develop and a deploy lands while this job is running:

12:00 job starts → status=1
12:30 deploy → ECS sends SIGTERM to old container
      → Sidekiq tries to finish within :timeout (25s). Fails.
12:31 stopTimeout exceeded → ECS sends SIGKILL
      → job dies instantly. ensure doesn't run. status=1 stays.

What does the user see? Refresh the page all you want — forever "syncing (4808/6424)".

"But doesn't a dead job get re-enqueued and retried?"

Half right, half wrong.

Sidekiq's standard mechanism (under SIGTERM)

Try to finish the running job within :timeout
If not, push the job back to the queue (re-enqueue) and exit
New container picks it up from the queue and starts from the beginning

→ Through this path, jobs don't disappear.

But under SIGKILL

Sidekiq itself dies instantly → no one to push the job back
OSS Sidekiq's default fetcher uses BRPOP, which deletes the job from Redis the moment it's fetched (non-reliable)
→ When a worker is SIGKILLed, that job is just gone

With Sidekiq Pro's super_fetch or sidekiq-reliable-fetch (OSS), jobs are kept in a separate set and recovered if a worker dies, but the default OSS Sidekiq doesn't do this.

60-second jobs vs. 1-hour jobs

Job length	Behavior on deploy
1 second	Almost always finishes within graceful window. Negligible.
60 seconds	Very high chance of finishing within stopTimeout. Even if SIGKILLed, recovers naturally on next scheduled run.
5 minutes	Sometimes misses the graceful window. Operationally annoying.
30min~1hour	SIGKILLed almost every time. Stuck every time. Time bomb.

Key insight: making jobs fit inside the graceful shutdown window is the most reliable defense.

How to actually defend against this

Keep jobs short (most important)

Remove pointless sleep
Batch heavy per-item calls (update / broadcast) into N-item chunks
Split big jobs into chunks and enqueue N jobs

Make them idempotent

Re-running from scratch yields the same result
Persist progress per chunk so re-runs can resume

Adopt reliable fetch

Sidekiq Pro super_fetch (paid)
sidekiq-reliable-fetch (OSS, watch maintenance status)
Prevents job loss even on SIGKILL

Stuck detection & auto re-run job (the pragmatic best answer)

Idea: "If Sidekiq died mid-execution, there must be a record stuck at status=1 in the DB. So find it periodically and re-run it."
Example: run the following job via a scheduler (cron / sidekiq-cron / whenever) every hour

class RecoverStuckSyncsJob < ApplicationJob
  queue_as :default

  STALE_AFTER = 1.hour

  def perform
    Integration.where(status: :running)
                .where('updated_at < ?', STALE_AFTER.ago)
                .find_each do |integ|
      Rails.logger.warn("[StuckRecovery] #{integ.id} stale since #{integ.updated_at}")
      integ.update!(status: :failed, alert: 'auto recovery: previous run was interrupted')
      SyncJob.perform_later(integ.id) # re-enqueue
    end
  end
end

Key points:
- If updated_at hasn't moved for a threshold period (no progress), treat as stuck
- If progress exists, in-loop update calls keep updated_at fresh → no false positives
- Flip status to failed then re-enqueue → releases the "running" UI state
- The job itself must be idempotent (safe to re-run from scratch)
Pick the interval to match your SLA. 1 hour means users can see stuck state for up to 1 hour. For sensitive screens, go to 10~15 minutes.

ensure insurance (auxiliary)

Recovers status on non-SIGKILL exit paths
Doesn't help with SIGKILL itself, but covers some broken termination paths

Summary

Async jobs aren't "safe because they retry". Under default OSS Sidekiq, SIGKILL means the job is gone.
Deploy timing and job length are directly linked. The longer the job, the higher the chance of dying on every deploy.
A screen showing "forever running" is usually the combination of SIGKILL + ensure-not-running + the job being long in the first place.
The most effective defense isn't fancy infrastructure — it's making jobs short.

Key Points

Imagine a job with sleep 0.5 × 6424 items ≈ 54 minutes

Mid-execution, someone merges to develop and a deploy lands

ECS sends SIGTERM to the old container

Sidekiq tries to shut down within :timeout (25s) → fails

stopTimeout exceeded → ECS sends SIGKILL

SIGKILL skips ensure/rescue → DB keeps status=1

OSS Sidekiq default fetcher is non-reliable, so the job itself does not return to the queue

User sees "running" forever no matter how many times they refresh

Pros

✓ Job separation shortens user response time (original benefit)
✓ Failure recovery can be automated via retries
✓ Most stuck incidents can be prevented just by watching job length

Cons

✗ Default OSS Sidekiq loses jobs on SIGKILL
✗ Death probability per deploy rises non-linearly with job length
✗ SIGKILL bypasses ensure/rescue entirely; no 100% code-level guarantee
✗ Adopting reliable fetch is either paid or carries OSS maintenance risk

Use Cases

Large-scale data sync jobs CSV import/export Report aggregation jobs Bulk external API call jobs Jobs with misguided throttling (sleep)

⏳

Background Jobs

Active Job + Sidekiq — process heavy tasks asynchronously

→

← ⏳ Background Jobs 📡 Action Cable →