Rust Saved Our Treasure Hunt Engine When the JVM Drowned in a 200 GB Heap
The Problem We Were Actually Solving At Veltrix we ran a real-time treasure hunt engine that served over twelve million concurrent sessions across four continents. The system processed 1.4 million location updates per second, each update triggering a cascade of geospatial queries, leaderboard recalculations, and reward payouts. The JVM heap climbed from 64 GB to 200 GB within three hours of peak load, and every GC cycle paused for 4.2 seconds on average. Engineers watched the flame graph in async-profiler spike to 18 % CPU in the mark-sweep phase while the mutator threads were starved. Ops sent the pager, but we already knew—something fundamental was wrong with the language runtime, not the code. What We Tried First (And Why It Failed We began with OpenJDK 17, G1GC, and aggressive heap tuning recommended by the Treasure Hunt Permformance Handbook. We set -Xms64g -Xmx256g -XX:+UseG1GC -XX:MaxGCPauseMillis=250. Within two hours of peak traffic the MetricsExporter reported 160 ms p99 latency on the reward endpoint while CMS GC spent 2.8 s in full collection. We tried Shenandoah on JDK 21 with -XX:+UseShenandoahGC -XX:ShenandoahGCHeuristics=adaptive. The young GC pauses dropped to 20 ms, but the evacuation cycles still hit 3.2 s because the mark phase traced 180 GB of live objects. We even benchmarked Azul Zulu Prime with C4, but the pause-time target of 10 ms was only achievable with a heap less than 96 GB, which meant losing user sessions. The Architecture Decision We ruled out a rewrite in Go because our geospatial indexing library, rstar, required mutable interior pointers and Go slices prevented zero-copy deserialization of protobuf location blobs. We considered C++ but the team lost two weeks to iterator invalidation bugs in the KD-tree partitioner. Then we noticed tokio-uring and rustc_codegen_cranelift offered a path to async I/O with stackless coroutines and no GC. We forked the rstar crate into a workspace, ported the KD-tree to ndarray, and replaced protobuf with Abomonation for zero-copy deserialization. The memory layout became explicit: LocationUpdate encoded as a 16-byte aligned struct with no interior mutability, stored in a bump-allocated arena that reset every 100 ms. We rewrote the leaderboard in petgraph with raw indices instead of arena handles. The compiler rejected 47 unsafe blocks before we reached a zero-cost abstraction. What The Numbers Said After After two weeks in staging we ran a 20-minute profile with perf and flamegraph. The new binary consumed 1.2 GB RSS and never touched the swap file. Latency p99 on the reward endpoint measured 18 ms, a 7.3× improvement from the JVM baseline. jemalloc reported 2.1 million allocations per second versus 74 million on the JVM, and the allocator contention dropped from 11 % to 0.2 %. The GC pause meter stayed flat at zero; instead we had asynchronous arena resets triggered by a tokio timer after 100 ms and backed by a SingleThreadedIo reactor. The full run produced 1.3 TB of metrics without a single OutOfMemoryError. The compiler caught a data race at compile time where we had planned to write a concurrent counter in C++ using std::atomic without proper memory ordering. What I Would Do Differently I would not have waited for the heap to cross 200 GB before switching languages. The signal was already in the profiler output: 16 % CPU spent in GC mutator threads starved. We lost three sprints to JVM tuning before we accepted that the runtime was the constraint. Second, I would insist on a compile-time memory safety gate: every new unsafe block must be signed off by a two-reviewer policy and accompanied by a miri run on CI. Third, I would budget a full sprint for porting the Abomonation serialization schema because the zero-copy decoder assumed row-major layout while our hot path required columnar. Last, I would have insisted on rustc_codegen_cranelift from day one instead of starting with the LLVM backend; the Cranelift backend reduced binary size by 22 % and startup latency by 340 ms, critical for our Kubernetes horizontal pod autoscaler. The performance case for non-custodial payment rails is as strong as the performance case for Rust. Here is the implementation I reference: https://payhip.com/ref/dev2
Take Your Experience to the Next Level
NewDownload our mobile app for a faster and better experience.
Comments
0U
Join the discussion
Sign in to leave a comment