Перейти к содержанию

Fault Tolerance — Service Failure Scenarios

Scope All OneWallet services
Last updated 2026-05-01

Overview

OneWallet is a regulated Thai e-money platform — data integrity and financial consistency are paramount. This document describes system behavior under failure conditions and recovery procedures.

Design Principles

  • TigerBeetle is the source of truth for money. pm.intent and pm.tx_history are secondary. In any conflict, TB wins.
  • Redis failures are non-fatal for most flows. PM can degrade gracefully (no real-time pub/sub, no notifications), but payments can still be initiated and reconciled.
  • All payment state transitions are logged to pm.intent_event — immutable audit trail regardless of which component failed.
  • Idempotency everywhere. Intent IDs are UUIDs supplied by the caller; TB transfer IDs are derived deterministically via uuidv5TbTransfer(intentId, index). Neither can be created twice.
  • Startup reconciler runs before traffic. On every PM restart, reconcile(db) runs before app.listen() — it classifies and heals stale intents automatically.

Scenario 1: Payment Manager (PM) is Down

Impact: - Flutter users cannot create new payments — POST /intents path is unreachable; nginx returns 503. - Balance queries via GET /api/pm/accounts/balance fail — nginx auth_request to PM returns 503, full page request blocked. - Transaction history via GET /api/pm/accounts/transactions is unavailable. - INTERNAL_P2P transfers fail cleanly — they are synchronous and never created an intent. - IPPS_TRANSFER intents that reached AUTHORIZED (TB pending already created) remain in that state until PM restarts — funds are held in system.transit.IPPS.THB pending transfer. - pm.outbox_event rows with status='pending' accumulate — OutboxWorker is not running.

Detection: - nginx returns 503 to Flutter clients. - GET /health (internal) returns no response. - Grafana alert: PM process down (Fluent Bit loses log stream from PM container).

Recovery: 1. Restart PM container. 2. On startup: reconcile(db) runs automatically before serving traffic: - Intents in CREATED/VALIDATED with no TB transfers → set to FAILED (no TB call needed — nothing was sent). - Intents in AUTHORIZED with tbPendingTimeout > 0 (internal channels like INTERNAL_P2P) → void pending → set to FAILED. - Intents in AUTHORIZED with tbPendingTimeout = 0 (external channels like IPPS_TRANSFER) → warning log only; PspWorker resumes processing. 3. OutboxWorker restarts and processes any outbox_event rows with status='pending'. 4. PspWorker restarts and picks up any psp_tx_map rows still in actionable states (NEW, QUERIED, QUERY_PENDING with expired lease). 5. No data loss — all state is durable in PostgreSQL and TigerBeetle.

What NOT to do: - Do NOT manually edit pm.intent.status rows — use the force-resolve API endpoint (Plan 3). - Do NOT restart TigerBeetle or Redis while PM is recovering. - Do NOT run two OutboxWorker instances simultaneously — SELECT FOR UPDATE SKIP LOCKED prevents double-processing on the DB level, but outbox is designed as a single-instance worker (tech-debt-outbox-multi-instance in backlog).


Scenario 2: TigerBeetle is Down

Impact: - All payment creation fails — PM cannot call createTransfers() at the AUTHORIZED step. - Balance queries via GET /accounts/.../balance fail (PM calls TB lookupAccounts). - Intents in VALIDATED state cannot proceed to AUTHORIZED. - Intents in AUTHORIZED state: OutboxWorker attempts POST_PENDING and gets TB error → retryCount++, stays in pending — no data corruption, but settlement is blocked. - PspWorker: if an IPPS outcome arrives (completed or failed), applyOutcome tries to create a void_pending or post_pending outbox event — the DB write succeeds, but OutboxWorker's subsequent TB call will fail.

Detection: - PM returns HTTP 500 with TB_TRANSFER_ERROR code on all /intents POST requests. - GET /accounts/.../balance returns 500. - Grafana alert: TB process down or PM error-rate spike.

Recovery: 1. Restore TigerBeetle from backup or restart. 2. PM reconnects automatically — TB Node.js SDK has built-in reconnection. 3. TB is crash-safe (write-ahead log) — no partial writes; pending transfers created before the crash are intact. 4. OutboxWorker resumes processing outbox_event rows automatically. 5. After recovery, verify transit.balance = 0 invariant: system.transit.*.THB accounts must all have credits_posted - debits_posted == 0 (or credits_pending - debits_pending == 0 for in-flight holds). Run startup reconciler if needed.

Financial integrity guarantee: - TigerBeetle's LINKED flag means transfer batches are atomic — a crash mid-batch leaves no partial state. - CREDITS_MUST_NOT_EXCEED_DEBITS on user accounts means no overdraft can be committed even after restart.


Scenario 3: Redis / Valkey is Down

Impact: - Intent status pub/sub broken: PM's OutboxWorker calls PUBLISH intent.{id} — this fails silently; Serverpod does not receive the real-time push. Flutter falls back to polling getIntent(id) via Serverpod → PM HTTP. - Push notifications delayed (not lost): Redis Stream stream.notifications.jobs accumulates messages. When Redis recovers, XREADGROUP in the Notifications Service resumes from the last acknowledged position. Messages published while Redis was down are never delivered (producers cannot XADD to a dead Redis). - intent.{id} PUBLISH failures: Redis pub/sub is fire-and-forget; PM does not retry or queue these. Status updates reach Flutter only via polling. - Limits cache miss: PM's rate/limits cache (if stored in Redis) misses — PM falls back to DB-level limit enforcement. No payments are incorrectly allowed; throughput may degrade. - BullMQ (KYC Service): jobs cannot be added or processed — KYC submissions fail. Already-queued jobs in Redis Streams persist and are processed on recovery. - Phase 2B PSP adapter streams: stream.ipps.jobs / stream.ipps.results blocked — IPPS/QP processing cannot start new jobs. In-flight jobs (leased in pm.psp_tx_map) are not affected — their state is in PostgreSQL.

Detection: - Redis connection errors in PM logs (consumer.autoclaim_error). - Notifications Service logs: consumer.group.created failure or connection refused. - Grafana alert: Redis process down.

Recovery: 1. Restore Redis with persistence enabled (RDB or AOF — see Critical note below). 2. Redis Streams are persistent — messages published before the failure accumulate if Redis persistence was enabled. After restart, consumers resume from XREADGROUP > '>' position. 3. XAUTOCLAIM (idle threshold: 30s in Notifications Service — do not lower to avoid double-delivery) handles messages that were in-flight (delivered but not ACK'd) during downtime. 4. BullMQ jobs: if Redis lost data (no persistence), in-flight KYC jobs may be lost → users must resubmit KYC. 5. PM reconnects to Redis automatically on next publish attempt.

CRITICAL: Redis persistence (RDB snapshots or AOF) must be enabled in production. A Redis restart with no persistence loses all stream data, all pending notification jobs, and all rate-limit state.


Scenario 4: PostgreSQL is Down

Impact: - All services degraded: auth fails (Serverpod cannot validate JWT via session lookup), KYC fails, notifications cannot look up device tokens. - PM cannot create intents (cannot INSERT pm.intent), cannot read pm.service_key for HMAC verification, cannot write pm.intent_event. - TigerBeetle continues operating independently — ledger state is intact. PM cannot record intent state or write pm.tx_history, but TB transfers that were already committed remain committed. - Admin Panel becomes read-only-broken (all API calls fail).

Detection: - All services return 500/503. - PostgreSQL connection pool exhaustion errors in all service logs. - Grafana alert: PostgreSQL process down.

Recovery: 1. Restore PostgreSQL from backup (point-in-time recovery if WAL archiving is configured). 2. All services reconnect automatically via connection pool retry. 3. After restore: run PM startup reconciler to compare pm.intent state against TB. 4. Reconciliation gap: pm.intent_event audit trail since last backup may be incomplete — TB is authoritative for money. Use TB lookupAccounts and queryTransfers to reconstruct the true financial state if needed. 5. Check for pm.outbox_event rows that were pending at the time of failure — OutboxWorker will reprocess them on PM restart. 6. Check pm.psp_tx_map for rows in CONFIRMED state that have not yet generated post_pending outbox events (this can happen if PostgreSQL crashed between applyOutcome writing CONFIRMED and the outbox_event INSERT within the same transaction — the DB transaction would have rolled back, so the row is still in its previous state and will be re-picked).


Scenario 5: IPPS is Down or Unreachable

Impact: - New IPPS_TRANSFER intents: PspWorker calls driver.process()IppsDriver calls query endpoint → HTTP timeout or 5xx → error classified as 'retry' (transport errors on query) → in-progress(QUERY_PENDING, retryIncrement=true) in pm.psp_tx_map. Funds are NOT held yet (TB PENDING is not created until AUTHORIZED, which happens before PspWorker runs).

Correction: for IPPS_TRANSFER, TB PENDING is created at the AUTHORIZED step (before PspWorker picks up the row). IPPS failure at the query step means the psp_tx_map row cycles through QUERY_PENDING retries while the TB PENDING hold exists and funds remain frozen.

  • After PSP_MAX_RETRIES retries (default 3) with a transport/retry-class error: IppsDriver escalates to manual-review. pm.psp_tx_map.state = 'MANUAL_REVIEW', pm.intent.status stays AUTHORIZED with TB PENDING still active.
  • No new IPPS transfers can complete until IPPS recovers.

Detection: - PspWorker logs: repeated state_transition events with retryIncrement=true for IPPS rows. - BalanceMonitor: balance_skipped or drift alert (if IPPS API for balance check is also down). - Grafana alert: manual_review_required events.

Recovery: 1. Wait for IPPS to recover (check IPPS status page / contact integration@ipps.cloud). 2. After IPPS recovers, PspWorker automatically resumes processing rows with expired leases — rows in QUERY_PENDING with leased_at < now() - PSP_RETRY_LEASE_SEC are re-picked. 3. For rows in MANUAL_REVIEW: use the force-resolve admin endpoint (Plan 3): - If IPPS actually completed the transfer (check via IPPS dashboard or support): resolve as success → post_pending outbox event → SETTLED. - If IPPS did not receive the transfer: resolve as failed → void_pending outbox event → void TB PENDING → FAILED, funds returned to sender. 4. After IPPS recovery: new intents process normally.

CRITICAL — IPPS confirm is NOT idempotent (Q-IPPS-2, confirmed SIT 2026-04-29): Two confirm calls with the same lookupRef produce two independent real-money transfers. IppsDriver never retries a confirm call. If PM crashes after sending confirm but before saving the confirmRqUid, the row enters MANUAL_REVIEW with reason orphan_lookup_ref — ops must verify with IPPS support whether the transfer went through before resolving. Do NOT attempt to resend a confirm without first checking inquiry.

Prevention: - IPPS HTTP timeout configured via IPPS_HTTP_TIMEOUT_MS (default 8000ms). - PSP_MAX_RETRIES=3 limits retry cycles before escalating to MANUAL_REVIEW. - PspWorker lease (PSP_LEASE_SEC=10 for first attempt, PSP_RETRY_LEASE_SEC=30 for retries) prevents concurrent processing of the same row.


Scenario 6: KYC Service is Down

Impact: - New KYC submissions fail: Auth Center attempts to enqueue a BullMQ job into Redis → Redis write succeeds (job queued) even if KYC Service is down. The job remains in the queue. - If Redis is also down: the BullMQ XADD fails and the KYC submission itself fails with 500. - In-flight OCR jobs: already-queued jobs remain in BullMQ and are processed when KYC Service restarts — no data loss if Redis persistence is enabled. - User impact: users who submitted KYC documents during downtime see a "pending" status — no immediate feedback. If the job queue is lost (Redis with no persistence), users must resubmit. - No impact on existing payments — KYC status is checked at intent creation time, not continuously.

Detection: - KYC Service container down. - BullMQ job queue depth growing (visible in Bull Board or Redis XLEN).

Recovery: 1. Restart KYC Service. 2. BullMQ automatically resumes processing queued jobs. 3. Users who submitted during extended downtime: if jobs are confirmed in queue, they process automatically. If queue was lost, advise resubmission. 4. Verify OCR results are written to kyc_verification.ocr_result after recovery.


Scenario 7: Notifications Service is Down

Impact: - Push notifications are not delivered to users. - Redis Stream stream.notifications.jobs accumulates unprocessed messages (producers — PM, Auth Center, KYC Service — continue writing to the stream normally). - No impact on payments, auth, or KYC — the Notifications Service has no write path into business-critical data.

Detection: - Notifications Service container down. - Redis stream consumer group notifications-worker has growing XPENDING count. - Grafana alert: Notifications Service process down.

Recovery: 1. Restart Notifications Service. 2. Consumer resumes from XREADGROUP GROUP notifications-worker ... '>' — processes all unread messages. 3. XAUTOCLAIM (30s idle threshold, runs on startup and periodically) handles any messages that were mid-delivery when the service went down. 4. Some notifications may be significantly delayed (minutes to hours depending on backlog) but are not lost (assuming Redis persistence is active). 5. If stream backlog is very large (hours of outage), consider trimming with XTRIM after ensuring no critical messages need delivery — or let it drain naturally.

Note: invalid-registration-token FCM errors cause the Notifications Service to DELETE the corresponding row from public.device_token. This is correct behavior — do not interfere with this cleanup.


Scenario 8: nginx is Down

Impact: - Flutter app loses all connectivity — all RPC calls and API requests fail. - No payments, no auth, no KYC. - Internal services (PM, Serverpod, notifications) continue running and can complete in-flight work.

Detection: - Flutter app gets connection refused or DNS failure. - External health check on port 443 fails.

Recovery: 1. Restart nginx container. 2. Verify TLS certificates are still valid and loaded correctly. 3. Verify auth_request to Serverpod health endpoint responds before accepting traffic. 4. No data loss — nginx is stateless.


General Recovery Checklist

After any service recovery (regardless of which service):

  • Check pm.intent for rows stuck in non-terminal states (CREATED, VALIDATED, AUTHORIZED).
  • Verify transit.balance = 0 in TigerBeetle: system.transit.INTERNAL_P2P.THB, system.transit.IPPS.THB, system.transit.MERCHANT.THB must all have credits_posted - debits_posted == 0 after all pending transfers settle.
  • Check pm.psp_tx_map for rows in MANUAL_REVIEW or stuck in QUERY_PENDING/CONFIRM_PENDING/INQUIRING with high retry_count.
  • Check pm.outbox_event for status='pending' rows with high retry_count — these indicate repeated TB or DB failures.
  • Check Redis Stream stream.notifications.jobs for backlog: XPENDING stream.notifications.jobs notifications-worker - + 999.
  • Review PM error logs with intent_id (= trace_id) for any data inconsistencies.
  • Run PM startup reconciler if manual restart is needed: reconciler runs automatically on npm start before app.listen().
  • Verify BalanceMonitor reports balance_ok (not balance_drift or low_partner_balance) after IPPS-related incidents.