Case Study · 05

MonkeyTilt , Gaming Platform Performance Optimization

Took a real-time gaming platform from crawling under load to a properly-instrumented backend. Optimized queries, restructured the business-logic layer, and automated deployments so the team could ship without holding their breath.

RoleSenior Backend Engineer & Performance Lead
Focus areasDatabase optimization, real-time backend performance, deploy automation
Engagement5 months · optimization & tooling, embedded with a 3-engineer platform team
Stack
PostgresRedisNodeTypeScriptTerraformAWS
01

Context

MonkeyTilt is a real-time gaming platform where latency and correctness aren't quality metrics — they're the product. A 200ms delay in wallet resolution is a user complaint. A double-credit is a financial error.

The platform had grown fast. The team was talented but had been in feature mode for a year and a half. The database was carrying more than it should, the business logic layer had grown inconsistently, and deploys required a human watching metrics for 20 minutes before anyone felt safe.

The engagement started with a performance audit and grew into a broader platform hardening effort. The team wanted to ship faster and sleep better.

02

Problem

A real-time gaming backend where the wallet table was a hotspot, queries were unplannable, and deploys were manual and anxious.

The wallet table was the center of the problem. High-frequency concurrent updates to a small set of rows created lock contention that cascaded into latency spikes visible to players. The query layer had no SLOs, so nobody knew exactly how bad it was until players noticed.

Deploys were manual: someone pushed, then watched the dashboard, then either declared it safe or rolled back. There was no automated gate, no canary, and no latency-based rollback criterion.

Why it needed to be done

In real-time gaming, latency and correctness failures are product failures.

Risk surface

In real-time gaming, latency and correctness failures are product failures.

The technical debt wasn't abstract. It was directly visible to players and directly correlated with churn.

!

Wallet lock contention causing player-visible latency

Hot-row contention on the wallet table was causing p95 latency spikes during peak sessions. Players experienced visible delays on the most time-sensitive action in the game.

$

Deploys requiring manual oversight for 20+ minutes

Every deploy was a 20-minute manual watch window. At the team's ship frequency, this was consuming engineering time and creating anxiety that slowed the release cadence.

~

No latency SLO meant no regression detection

Without baseline metrics and a defined SLO, performance regressions were only discovered by players. The team had no system to catch them earlier.

Solution

What was built and how it fits together.

01Latency baseline and SLO definition
Instrumented p50, p95, and p99 latency across the hot paths before touching any code. Defined SLOs per endpoint class, giving the team a shared definition of 'acceptable' for the first time.
02Targeted query and index pass
EXPLAIN ANALYZE on the top 20 slow queries. Index additions and rewrites where the plan changed. Composite indexes for the multi-column predicates that the ORM was generating as full scans.
03Wallet hot-row redesign
Replaced the single-row wallet update pattern with a balance-ledger model: each transaction appends a row, and the current balance is a projection. Lock contention dropped to near zero.
04Cached projections
Balance and leaderboard projections are materialized in Redis on write. Reads hit the cache; the database is updated asynchronously. Cache invalidation is event-driven and audited.
05Latency-gated canary deploys
A deploy pipeline that routes 10% of traffic to the new version, measures p95 latency against the SLO for 5 minutes, and either promotes or rolls back automatically. No human in the loop unless the gate fails.
06Infrastructure as code
Terraform covering the full production footprint: ECS services, RDS, Redis, ALBs, and autoscaling policies. New environments from a single pipeline; config drift from the past three years eliminated.
Key technical work

The pieces of the build that mattered most.

01

Latency baseline and SLO

OpenTelemetry instrumentation across all hot paths, with dashboards per endpoint class. SLOs defined at p95 for wallet, game-state, and leaderboard endpoints before any changes were made.

OpenTelemetryGrafanaSLO definition
02

Targeted query and index pass

EXPLAIN ANALYZE on the top 20 queries by total database time. Composite index additions for multi-column predicates; query rewrites where the ORM was producing suboptimal plans.

EXPLAIN ANALYZEComposite indexesQuery rewrite
03

Wallet hot-row redesign

Migrated from a single-row mutable balance to a ledger model: immutable append-only transaction rows, balance derived as a projection. Lock contention eliminated; audit trail added for free.

Ledger modelAppend-onlyPostgres
04

Cached projections with Redis

Balance and leaderboard projections written to Redis on each transaction commit. Cache reads serve the hot path; async Postgres updates keep the authoritative record current.

RedisWrite-through cacheAsync updates
05

Latency-gated canary pipeline

GitHub Actions pipeline with a canary step: 10% traffic split, 5-minute observation window, automated SLO check, promote or rollback. Zero manual steps for a clean deploy.

GitHub ActionsCanary deploySLO gate
06

Infrastructure as code

Terraform for the full production footprint. Eliminated three years of console-click drift. New environments reproducible in a single pipeline run.

TerraformECSRDS
Business impact

What came out of it.

placeholderp50 query time–64%Median query latency across wallet and game-state paths after the index pass and ledger redesign.
placeholderp95 query time–71%95th percentile latency, the number that was causing player-visible spikes. Measured over the same traffic window as the baseline.
placeholderDeploy time–80%Time from merge to safe production deploy, including the canary window. Down from 20+ minutes of manual watching to a fully automated pipeline.
placeholderError rate0.04%Post-optimization steady-state error rate on the wallet and game-state paths. SLO target was under 0.1%.

Values marked placeholder are representative — replace with measured numbers from the live system once available.

Final result

A real-time backend the team can change without flinching.

MonkeyTilt's hot path is faster, calmer, and instrumented end-to-end. The wallet stops being the limiter; the database stops being the mystery; the deploy pipeline catches perf regressions before customers do. The team ships at higher frequency with less anxiety than before the engagement started.

p95 wallet latency reduced 71% via ledger redesign
Automated canary pipeline replacing 20-minute manual watches
Latency SLOs defined and monitored per endpoint class
Redis projection cache eliminating hot read paths
Full Terraform coverage eliminating infrastructure drift
Next engagement

Have a similar system to build or optimize?

If you have a real-time backend with latency or correctness problems, send a few sentences. I'll respond directly within one business day.

Book a callbilalasharf@gmail.com