KV Cache Memory Size - Search News

Nvidia says it can shrink LLM memory 20x without changing model weights

Nvidia's KV Cache Transform Coding (KVTC) compresses LLM key-value cache by 20x without model changes, cutting GPU memory costs and time-to-first-token by up to 8x for multi-turn AI applications.

14d

New KV cache compaction technique cuts LLM memory 50x without accuracy loss

MIT researchers developed Attention Matching, a KV cache compaction technique that compresses LLM memory by 50x in seconds — ...

TMCnet

Penguin Solutions Introduces Industry's First Production-Ready CXL-Based KV Cache Server

Accelerating memory-dependent AI processes, Penguin's MemoryAI KV cache server increases memory capacity by integrating 3 TB of DDR5 main memory and up to eight 1 TB CXL Add-in Cards (AICs). Penguin ...

MarketWatch

XConn Technologies and MemVerge Demonstrate CXL Memory Pool for KV Cache using NVIDIA Dynamo for breakthrough AI workload performance at 2025 OCP Global Summit

The MarketWatch News Department was not involved in the creation of this content. XConn Technologies and MemVerge Demonstrate CXL Memory Pool for KV Cache using NVIDIA Dynamo for breakthrough AI ...

Show inaccessible results

Nvidia says it can shrink LLM memory 20x without changing model weights

New KV cache compaction technique cuts LLM memory 50x without accuracy loss

Penguin Solutions Introduces Industry's First Production-Ready CXL-Based KV Cache Server

XConn Technologies and MemVerge Demonstrate CXL Memory Pool for KV Cache using NVIDIA Dynamo for breakthrough AI workload performance at 2025 OCP Global Summit

Penguin Solutions’ OriginAI Factory Platform Delivers Optimized Performance for AI Inference

Mamba 3, a state space model and an alternative to transformers

Dynamic KV Cache Scheduling in Heterogeneous Memory Systems for LLM Inference (Rensselaer Polytechnic Institute, IBM)