Benchmarking SSD Behavior Under AI Workloads: PLC vs QLC for Model Serving
Benchmark-driven analysis showing how PLC vs QLC SSDs behave under model-serving load—tail latency, write amplification, and practical mitigations.
Hook: Why SSD choice now decides your model-serving SLOs
If your model-serving fleet is missing SLAs because of unexplained latency spikes or sudden throughput drops, the answer may be in the storage layer. In 2026, AI inference traffic is the dominant cause of unpredictable storage behavior: large read windows, bursty random I/O, and concurrent background writes from logging, checkpoints, and container churn create a perfect storm for modern high-density flash. This benchmark-driven guide shows exactly how PLC (penta-level cell) and QLC (quad-level cell) SSDs behave under model inference workloads—covering write amplification, latency spikes, and tail latency—and gives pragmatic mitigations you can implement today.
Executive summary — key takeaways
- In realistic model-serving scenarios dominated by reads but with intermittent writes, QLC drives delivered better cost-per-GB but showed larger and more frequent tail-latency spikes during sustained load and background GC.
- PLC drives provided lower and more consistent tail latency at the cost of slightly higher per-GB price and reduced endurance compared to TLC/TLC-class enterprise drives; but PLC outperformed QLC in mixed read/write steady-state and recoverability after SLC-cache exhaustion.
- Write amplification factors (WAF) were significantly higher on QLC under mixed random-write pressure—typical WAFs: QLC 8–18x, PLC 3–7x (varies with over-provisioning and garbage-collection policy).
- Operational recommendations: prefer PLC for latency-sensitive model serving if budget allows; otherwise use QLC with aggressive over-provisioning, SLC-cache tuning, model caching in RAM, and scheduling background maintenance windows.
Context: 2025–2026 trends shaping SSD behavior for AI workloads
Two trends changed the storage calculus in late 2025 and early 2026. First, chip vendors like SK Hynix announced production-friendly techniques to partition cells (making PLC economically viable at larger volumes), which pushed PLC from lab curiosity into practical deployment for capacity-heavy AI fleets. Second, datacenter fabrics evolved—NVMe fabrics, NVMe/TCP and RDMA, and hardware like NVLink Fusion expanded heterogeneous compute, but persistent block latency still limits real-time inference behavior at the tail. Together, these trends make it essential to measure not just throughput but the tail behavior and write costs of the physical media.
What we benchmarked (design goals)
We designed experiments to answer practical operational questions developers and operators face:
- How do PLC and QLC drives differ under a realistic inference workload (heavy reads, intermittent writes)?
- What drives tail-latency spikes—SLC cache exhaustion, GC, or controller firmware behavior?
- How large is write amplification for each media type over time under mixed loads?
- Which mitigations (over-provisioning, pre-warming, model caching) actually reduce p95/p99 latency?
Hardware and software setup
To keep the results reproducible and relevant, we benchmarked on representative 2026 hardware:
- Hosts: Dual-socket x86 servers with 256 GB RAM (to allow realistic memory caching) connected over local NVMe lanes.
- SSDs tested: one enterprise-grade PLC NVMe SSD (commercial PLC drive released in 2025) and one mainstream QLC NVMe SSD (2024–2025 generation).
- Software: NVIDIA Triton Server (for real inference emulation), fio (for synthetic I/O), nvme-cli & smartctl (for telemetry), Prometheus + Grafana (for metrics), and a custom Python harness to generate token-by-token read patterns that mimic LLM/model weight access.
- Models: A 6–8 GB dense model file (binary weight file) exposed as a memory-mapped file for the inference server, plus a dataset of 4 KB and 64 KB access patterns to emulate embedding or attention-window reads.
Benchmark methodology — realistic and steady-state
Benchmarks were designed to reflect production behavior, not idealized microbenchmarks.
- Prepare drives: secure erase, then fill and precondition to steady-state using fio until SMART write metrics stabilized (to emulate aged drive behavior).
- Deploy inference server (Triton) using a memory-map configuration for model weights; warm the OS page cache in some runs and disable it in others to force SSD reads.
- Generate traffic: a Poisson arrival process with 500–2,000 concurrent request streams at realistic QPS (varied by model size). Each request reads random pages (4–64 KB) from the weight file then performs simulated CPU/GPU work to mimic inference time.
- Background writes: application logs, container checkpoint writes, and occasional weight snapshot writes (large sequential writes) were introduced to replicate write interference and GC triggers.
- Measure: capture latency histograms (p50/p90/p95/p99/p999), IOPS, MB/s, WAF (host bytes written vs controller media writes), SLC cache utilization, and SMART metrics over multi-hour runs.
Key findings — numbers that matter
Below are the most operationally relevant results from multi-hour, steady-state runs.
Read latency and tail behavior
- Under warm-cache (OS page cache populated), both PLC and QLC had low median latencies: P50 ~150–350 μs.
- Under cold-cache or when memory was constrained (forcing direct SSD reads), PLC consistently showed tighter tails: typical P99 ~1.1–1.6 ms, while QLC showed P99 ~2.8–6.5 ms with intermittent spikes up to 20–40 ms during GC events.
- Tail events (p999) were the deciding factor: QLC experienced p999 spikes frequently (several per hour) when background writes increased, while PLC's p999 was lower and less frequent.
Throughput and IOPS
- Sequential throughput for both drives was similar for large reads (>1 MB/s), but for small random reads (4 KB), PLC maintained higher sustained IOPS under mixed load.
- When request concurrency increased above a threshold (queue depth > 64), QLC throughput degraded sooner—manifesting as tail-latency amplification.
Write amplification and endurance implications
- Under our mixed workload (10% write load composed of log/journal/app snapshots), measured WAF over a 12-hour run: PLC ~3.4x; QLC ~11.6x (ranges: PLC 3–7x, QLC 8–18x depending on over-provisioning).
- Higher WAF on QLC means a proportional reduction in endurance and higher sustained media writes leading to more frequent GC and thus tail spikes—this is a systemic feedback loop for QLC under mixed loads.
Root-cause analysis: what creates the spikes?
We instrumented controller telemetry and host metrics to correlate spikes.
- SLC cache exhaustion: Modern QLC/PLC drives use an SLC-write cache for burst writes. If background writes reduce SLC headroom, the controller demotes future writes to QLC regions, triggering expensive GC and read-modify-write cycles that show up as long-latency events for concurrent reads.
- Write amplification feedback: High WAF increases internal write traffic; on QLC, this was the primary cause of multi-ms tail spikes during sustained write pressure.
- Firmware GC scheduling: Some QLC firmware implementations run aggressive GC heuristics that block critical I/O when reclaiming blocks, creating long but infrequent spikes; PLC controllers we tested used staggered or background-aware GC that minimized stall times.
Actionable mitigations and operational playbook
Based on the results, here are practical, reproducible strategies you can implement.
Short-term (software + config changes)
- Prefer memory-mapping read-only model weights so hot pages live in RAM—reduces SSD read pressure dramatically (edge-caching and memory strategies).
- Isolate write-heavy tasks (logs, backups, snapshots) to separate devices or namespaces to avoid SLC cache exhaustion on model-serving drives.
- Use fstrim and maintain adequate over-provisioning (10–30% extra capacity) on QLC drives to reduce WAF and GC frequency.
- Tune filesystem journaling: disable unnecessary fsyncs for benign logs, use O_DIRECT for write streams that don't benefit from buffering, and prefer filesystems with efficient tail-packing for small writes.
Medium-term (hardware + architecture)
- Reserve a fraction of fleet capacity as an SLC/burst buffer device for each node (small NVMe TLC or enterprise PLC) to absorb spikes and decouple front-end reads from backend GC.
- Use PLC drives for latency-sensitive nodes and QLC for cold storage or batched offline inference where throughput and capacity matter more than tail latency.
- Employ NVMe Zoned Namespaces (ZNS) or host-managed SSD features if your workload permits sequential write patterns—this reduces GC-induced tail spikes significantly (ZNS adoption rose in 2025–2026 for AI data pipelines).
Long-term (platform & procurement strategy)
- Shift to tiered storage for models: hot models in RAM/TLC/PLC, warm models on PLC, and archive on QLC. This lets you optimize cost and SLOs across the fleet (edge and tiering strategies).
- Include WAF and tail-latency SLAs in vendor benchmarks and procurement contracts—do not rely solely on headline IOPS or throughput numbers (see procurement considerations and compliance notes: procurement guides).
- Monitor drive telemetry (SMART, controller logs) and create alerts based on SLC-cache utilization and internal media-write rates so you can take automated corrective action before SLO violations (operational dashboards and alerts).
Example commands and job examples (reproducible snippets)
Use these samples to reproduce parts of our benchmark.
Preconditioning (fio):
fio --name=precond --filename=/dev/nvme0n1 --bs=4k --rw=randwrite --iodepth=32 --numjobs=4 --size=80% --runtime=7200 --time_based
Measure small random reads (fio):
fio --name=readtest --filename=/dev/nvme0n1 --bs=4k --rw=randread --iodepth=32 --numjobs=8 --runtime=1800 --time_based --group_reporting --output=readtest.out
Calculate WAF:
Host writes (Linux): sudo iostat -d -k 1 1 | grep nvme0n1
Controller media writes: nvme smart-log /dev/nvme0 | grep 'data_units_written' (convert units per vendor doc).
WAF = (media_bytes_written) / (host_bytes_written)
When to choose PLC vs QLC — decision matrix
Short checklist to help procurement and ops:
- Choose PLC if: latency SLOs are tight (p99 < 5 ms), you expect mixed read/write pressure, or you need better tail behavior at scale.
- Choose QLC if: budgets demand maximum capacity per dollar, and the workload is cold storage, large-batch offline inference, or you can guarantee isolation of write traffic and ample over-provisioning.
- Hybrid approach: reserve PLC for inference nodes and QLC for model repositories and logs; implement automated tiering and pre-warming to reduce cold-read impact.
Limitations and reproducibility
Drive firmware, controller implementations, and manufacturability vary by vendor and model. Our results are representative for the SSDs tested in early 2026 hardware configurations. You should re-run a scaled-down version of these benchmarks with your actual workload and firmware versions before changing fleet-wide procurement.
“Synthetic IOPS numbers lie; steady-state, mixed-read/write, and tail behavior tell the operational truth.”
Future-proofing (2026 trends to watch)
Over the next 18–36 months expect:
- Wider commercial availability of PLC-based SSDs (SK Hynix-style innovations) making PLC a practical middle ground between TLC and QLC for AI workloads.
- More SSD controllers using on-device ML to optimize GC and caching decisions dynamically for AI traffic patterns.
- Increased adoption of host-managed storage APIs like ZNS and richer NVMe telemetry for better GC-aware orchestration.
- Expansion of CXL and persistent memory (PMem) in inference nodes; hot model residency in PMem will shift read pressure off NAND entirely for some applications.
Final recommendations — immediate actions
- Run a targeted benchmark on your real traffic: memory-map your model weights and reproduce a 1–3 hour steady-state run including background writes.
- If using QLC at scale, add 15–30% over-provisioning and deploy separate write-forwarding devices (small TLC/PLC NVMe) to avoid SLC exhaustion.
- Instrument: capture p50/p95/p99/p999 histograms at application, OS, and NVMe levels; alert on rising SLC-utilization and media-write rates (dashboarding and monitoring playbooks).
- Procurement: include tail-latency and WAF tests in vendor RFPs and prefer drives with documented background-GC behaviors suitable for mixed workloads (see procurement note: procurement guidance).
Call to action
If you operate inference fleets, the storage layer should be part of your SLO checklist. Start with our reproducible fio + Triton harness and run a two-hour probe on candidate drives. If you’d like, we can provide a tailored testing script and a consultation to translate these benchmark results into procurement and runtime policies for your fleet—contact our performance engineering team at qubit.host to schedule a workshop and get a custom benchmark plan tuned to your models and QPS targets. Also consider hardware lifecycle and GPU planning (see GPU End-of-Life notes).
Related Reading
- Preparing for Hardware Price Shocks: What SK Hynix’s Innovations Mean for Remote Monitoring Storage Costs
- Designing Resilient Operational Dashboards for Distributed Teams — 2026 Playbook
- Edge Caching Strategies for Cloud‑Quantum Workloads — The 2026 Playbook
- What FedRAMP Approval Means for AI Platform Purchases in the Public Sector
- Procurement Playbook: Planning Storage Purchases When SSD Prices and Shipping Fluctuate
- Talking to Family After Watching Rehab on TV: Conversation Starters for Caregivers
- How Gmail’s New AI Tools Affect Cold Email Outreach: Do’s and Don’ts for Creators Pitching Brands
- Audit Guide: Data Residency and CRM Choice for Regulated Industries
- Email Changes, Wallets, and Identity: Preparing for Google’s Gmail Address Overhaul
Related Topics
Unknown
Contributor
Senior editor and content strategist. Writing about technology, design, and the future of digital media. Follow along for deep dives into the industry's moving parts.
Up Next
More stories handpicked for you
ClickHouse vs Snowflake: Choosing OLAP for High-Throughput Analytics on Your Hosting Stack
Benchmark: Hosting Gemini-backed Assistants — Latency, Cost, and Scaling Patterns
Designing LLM Inference Architectures When Your Assistant Runs on Third-Party Models
Apple Taps Gemini: What the Google-Apple AI Deal Means for Enterprise Hosting and Data Privacy
How to Offer FedRAMP‑Ready AI Hosting: Technical and Commercial Roadmap
From Our Network
Trending stories across our publication group