๐Ÿ’€

Sidekiq: Deploys and the Async Job Pitfall

SIGTERM/SIGKILL โ€” why long Sidekiq jobs die on every deploy

Starting point โ€” a real incident

One day a screen was found stuck in "syncing" state forever. Looking at the DB:

integ.status            # => 1  (running)
integ.success_count     # => 4808 / 6424
integ.latest_sync_at    # => yesterday

The job is nowhere in Sidekiq's Workers / Queue / Retry / Dead. The job has vanished but the DB state still says "running" โ€” meaning the job died mid-execution.

Tracing the cause leads to OS signals

Two ways the OS terminates a process:

SIGTERM (graceful)

"Hey, wrap up your work and shut down nicely"

  • OS just sends a signal. The process handles termination itself.

  • Sidekiq side: stop accepting new jobs โ†’ wait for current job to finish โ†’ if it doesn't, push back to retry queue and exit

  • Ruby side: at_exit, ensure, rescue all run

SIGKILL (forced)

"Just die now"

  • OS forcibly kills the process immediately. No say in the matter.

  • Code dies wherever it was executing

  • ensure, rescue, at_exit none run

  • DB transactions end uncommitted

  • Cannot be caught by signal handlers

What happens during ECS / k8s deploys

1. Spin up new container
2. Healthcheck OK โ†’ send SIGTERM to old container
3. Wait for stopTimeout (e.g., 30~120 seconds)
4. Still alive โ†’ SIGKILL

This "wait for stopTimeout" period is the graceful shutdown window.

Sidekiq's behavior:
1. Stop polling
2. If there's a job running, wait :timeout (default 25s)
3. Finished within 25s โ†’ graceful shutdown โœ…
4. Not finished โ†’ push job back to retry queue and try to exit
5. Still over stopTimeout โ†’ ECS sends SIGKILL โ†’ instant death, ensure does not run โŒ

How a one-hour job gets stuck

class HeavySyncService
  def run
    @integ.update(status: 1)            # mark as running on entry
    begin
      6424.times do |i|
        sleep 0.5                        # 0.5s per item
        process_one(i)                   # plus actual processing time
      end
      @integ.update(status: 77)          # mark as success on normal end
    rescue => e
      @integ.update(status: 44)          # mark as failed on exception
      raise e
    end
  end
end

6424 items ร— 0.5s โ‰ˆ 54 minutes. If someone merges to develop and a deploy lands while this job is running:

12:00 job starts โ†’ status=1
12:30 deploy โ†’ ECS sends SIGTERM to old container
      โ†’ Sidekiq tries to finish within :timeout (25s). Fails.
12:31 stopTimeout exceeded โ†’ ECS sends SIGKILL
      โ†’ job dies instantly. ensure doesn't run. status=1 stays.

What does the user see? Refresh the page all you want โ€” forever "syncing (4808/6424)".

"But doesn't a dead job get re-enqueued and retried?"

Half right, half wrong.

Sidekiq's standard mechanism (under SIGTERM)

  1. Try to finish the running job within :timeout
  2. If not, push the job back to the queue (re-enqueue) and exit
  3. New container picks it up from the queue and starts from the beginning

โ†’ Through this path, jobs don't disappear.

But under SIGKILL

  • Sidekiq itself dies instantly โ†’ no one to push the job back

  • OSS Sidekiq's default fetcher uses BRPOP, which deletes the job from Redis the moment it's fetched (non-reliable)

  • โ†’ When a worker is SIGKILLed, that job is just gone

With Sidekiq Pro's super_fetch or sidekiq-reliable-fetch (OSS), jobs are kept in a separate set and recovered if a worker dies, but the default OSS Sidekiq doesn't do this.

60-second jobs vs. 1-hour jobs

Job length Behavior on deploy
1 second Almost always finishes within graceful window. Negligible.
60 seconds Very high chance of finishing within stopTimeout. Even if SIGKILLed, recovers naturally on next scheduled run.
5 minutes Sometimes misses the graceful window. Operationally annoying.
30min~1hour SIGKILLed almost every time. Stuck every time. Time bomb.

Key insight: making jobs fit inside the graceful shutdown window is the most reliable defense.

How to actually defend against this

  1. Keep jobs short (most important)
  • Remove pointless sleep

  • Batch heavy per-item calls (update / broadcast) into N-item chunks

  • Split big jobs into chunks and enqueue N jobs

  1. Make them idempotent
  • Re-running from scratch yields the same result

  • Persist progress per chunk so re-runs can resume

  1. Adopt reliable fetch
  • Sidekiq Pro super_fetch (paid)

  • sidekiq-reliable-fetch (OSS, watch maintenance status)

  • Prevents job loss even on SIGKILL

  1. Stuck detection & auto re-run job (the pragmatic best answer)
  • Idea: "If Sidekiq died mid-execution, there must be a record stuck at status=1 in the DB. So find it periodically and re-run it."

  • Example: run the following job via a scheduler (cron / sidekiq-cron / whenever) every hour

class RecoverStuckSyncsJob < ApplicationJob
  queue_as :default

  STALE_AFTER = 1.hour

  def perform
    Integration.where(status: :running)
                .where('updated_at < ?', STALE_AFTER.ago)
                .find_each do |integ|
      Rails.logger.warn("[StuckRecovery] #{integ.id} stale since #{integ.updated_at}")
      integ.update!(status: :failed, alert: 'auto recovery: previous run was interrupted')
      SyncJob.perform_later(integ.id) # re-enqueue
    end
  end
end

  • Key points:

    • If updated_at hasn't moved for a threshold period (no progress), treat as stuck
    • If progress exists, in-loop update calls keep updated_at fresh โ†’ no false positives
    • Flip status to failed then re-enqueue โ†’ releases the "running" UI state
    • The job itself must be idempotent (safe to re-run from scratch)
  • Pick the interval to match your SLA. 1 hour means users can see stuck state for up to 1 hour. For sensitive screens, go to 10~15 minutes.

  1. ensure insurance (auxiliary)
  • Recovers status on non-SIGKILL exit paths

  • Doesn't help with SIGKILL itself, but covers some broken termination paths

Summary

  • Async jobs aren't "safe because they retry". Under default OSS Sidekiq, SIGKILL means the job is gone.

  • Deploy timing and job length are directly linked. The longer the job, the higher the chance of dying on every deploy.

  • A screen showing "forever running" is usually the combination of SIGKILL + ensure-not-running + the job being long in the first place.

  • The most effective defense isn't fancy infrastructure โ€” it's making jobs short.

Key Points

1

Imagine a job with sleep 0.5 ร— 6424 items โ‰ˆ 54 minutes

2

Mid-execution, someone merges to develop and a deploy lands

3

ECS sends SIGTERM to the old container

4

Sidekiq tries to shut down within :timeout (25s) โ†’ fails

5

stopTimeout exceeded โ†’ ECS sends SIGKILL

6

SIGKILL skips ensure/rescue โ†’ DB keeps status=1

7

OSS Sidekiq default fetcher is non-reliable, so the job itself does not return to the queue

8

User sees "running" forever no matter how many times they refresh

Pros

  • Job separation shortens user response time (original benefit)
  • Failure recovery can be automated via retries
  • Most stuck incidents can be prevented just by watching job length

Cons

  • Default OSS Sidekiq loses jobs on SIGKILL
  • Death probability per deploy rises non-linearly with job length
  • SIGKILL bypasses ensure/rescue entirely; no 100% code-level guarantee
  • Adopting reliable fetch is either paid or carries OSS maintenance risk

Use Cases

Large-scale data sync jobs CSV import/export Report aggregation jobs Bulk external API call jobs Jobs with misguided throttling (sleep)