Simulating vLLM's PagedAttention

An implementation of a simulation of PagedAttention in Python

Piyush Choudhari
6 min read
AIML
blog
PagedAttention
KVCache
vLLM

Why?

Anthropic recently released an article, Code execution with MCP: Building more efficient agents and it focused heavily on token conservation and how agents can use sandboxed environments to execute code to fetch results.

Idea of token/context management stuck with me and then I went down the rabbit hole of token management and optimization where I started studying vLLM inference runtimes. vLLM employs a ton of methods to optimize inference but I found PagedAttention very interesting, so I decided to simulate it.

The Problems

Nowadays LLM serving is one the most crucial components of applications. But still it faces a lot of challenges related to cost, performance and resource management.

Common challenges:

  • High computational load
  • GPU utilization
  • Latency issues
  • Inefficient memory management:
    1. Internal fragmentation
    2. Reservation waste
    3. External fragmentation
  • Complexity of distributed inference

vLLM inference runtime aims to resolve these challenges and one of the core components it uses is PagedAttention

PagedAttention

In LLMs, the KV cache plays a important role during inference. As tokens are generated one by one, because LLMs are autoregressive, the model stores the key and value tensors from previous steps instead of recomputing them. This makes future token prediction faster because the model can reuse past computations.

Early implementations followed a memory allocation strategy borrowed from traditional deep learning workloads. Each inference request received a large, contiguous block of GPU memory reserved for its KV cache. The problem was that sequence lengths during inference are unpredictable. Some inputs end quickly, others expand due to long responses or continued user interaction. Since the system could not know the exact length ahead of time, it had to reserve more memory than necessary.

This led to several inefficiencies;

  • Most of the allocated memory remained unused, contributing to internal fragmentation.

  • Gaps appeared between allocated memory regions, creating external fragmentation.

  • Memory was often blocked for potential future tokens but never consumed.

Despite reserving large chunks of GPU memory, real usage often hovered between 20 and 40 percent. As a result, throughput suffered because the model could serve fewer simultaneous requests before exhausting memory. The hardware was capable of more, but the memory layout became the bottleneck.

PagedAttention redesigns KV cache storage using a paging mechanism inspired by operating systems. Instead of one large allocation, the cache is broken into smaller fixed-size chunks called KV blocks.

Key ideas introduced by PagedAttention:

  • Non-contiguous storage
Logical sequence:
[Block 1][Block 2][Block 3][Block 4]
 
Physical GPU layout:
[Block 1]       [Block 3] [Block 4]     [Block 2]
    ^ mapping handled by page table ^
  • On-demand allocation
Initial:
[Block 1]
 
As sequence grows:
[Block 1][Block 2][Block 3] ... allocated only when needed
  • Near-zero waste
    Only the final incomplete block may have unused memory.

  • Page-table-style lookup

+-------------+-------------------------+
| Token Range | Physical Block Address  |
+-------------+-------------------------+
|   0 - 255   | Block_12                |
| 256 - 511   | Block_7                 |
| 512 - 767   | Block_19                |
+-------------+-------------------------+
  • Copy-on-write sharing for generation branches

During beam search or sampling, multiple candidates often share early history:

Before divergence:
Beam A → [B1][B2][B3]
Beam B → [B1][B2][B3]   (shared memory)
 
After divergence:
Beam A → [B1][B2][B3][B4]
Beam B → [B1][B2][B3][B4’]  (copy only when needed)
  • Higher throughput and scaling
    Since memory is used efficiently, more requests fit into GPU memory, increasing batch size and token throughput.

Implementation

Data Structures

KVCacheBlock

  • Represents a physical block in KV cache
  • Properties:
    • block_id: Unique identifier for the block
    • reference_count: Number of requests sharing this block
    • block_hash: SHA256 hash of block tokens for deduplication
class KVCacheBlock(BaseModel):
    block_id: int
    reference_count: int = 0
    block_hash: Optional[str] = None

Request

  • Represents a user request with tokens and allocated blocks
  • Properties:
    • tokens: List of token IDs
    • block_size: Size of each block
    • block_table: List of physical block IDs allocated to this request
class Request(BaseModel):
    tokens: List[int]
    block_size: int
    block_table: List[int] = []

PagedAttention Class

Initialization

  • total_blocks: Maximum number of blocks available
  • block_size: Number of tokens per block
  • Maintains free blocks queue, physical block mappings, and hash table
def __init__(self, total_blocks, block_size):
    self.total_blocks = total_blocks
    self.free_blocks = deque(range(total_blocks))
    self.physical_blocks = {}
    self.hash_to_block = {}
    self.cache_hits = 0
    self.cache_misses = 0

Block Hashing (hash_block_tokens)

  • Creates SHA256 hash for token sequences
  • Incorporates parent hash for sequential block identification
  • Enables block-level deduplication
def hash_block_tokens(self, tokens, parent_hash):
    hasher = hashlib.sha256()
    if parent_hash:
        hasher.update(parent_hash.encode())
    for token in tokens:
        hasher.update(str(token).encode())
    return hasher.hexdigest()[:16]

Block Allocation (allocate_block)

  • Allocates free blocks from the pool
  • Returns block_id if available, None if out of memory
def allocate_block(self):
    if not self.free_block:
        return None
    block_id = self.free_blocks.popleft()
    return block_id

Block Deallocation (free_block)

  • Removes block from physical memory
  • Cleans up hash mappings
  • Returns block to free pool
def free_block(self, block_id: int):
    if block_id in self.physical_blocks:
        block = self.physical_blocks[block_id]
        if block.block_hash and block.block_hash in self.hash_to_block:
            del self.hash_to_block[block.block_hash]
        del self.physical_blocks[block_id]
    self.free_blocks.append(block_id)

Token Allocation (allocate)

  • Divides tokens into blocks (block_size tokens per block)
  • Implements cache hit logic: reuses existing blocks if hash matches
  • Implements cache miss logic: allocates new blocks for unique token sequences
  • Increments reference count for shared blocks
def allocate(self, tokens):
    num_tokens = len(tokens)
    num_blocks = (num_tokens + self.block_size - 1) // self.block_size
    request = Request(tokens=tokens, block_size=self.block_size)
    parent_hash = None
    
    for block_idx in range(num_blocks):
        block_tokens = tokens[start_idx:end_idx]
        block_hash = self.hash_block_tokens(block_tokens, parent_hash)
        
        if block_hash in self.hash_to_block:  # Cache hit
            cached_block = self.hash_to_block[block_hash]
            cached_block.reference_count += 1
            request.block_table.append(cached_block.block_id)
            self.cache_hits += 1
        else:  # Cache miss
            new_block_id = self.allocate_block()
            new_block = KVCacheBlock(block_id=new_block_id, reference_count=1, block_hash=block_hash)
            self.physical_blocks[new_block_id] = new_block
            self.hash_to_block[block_hash] = new_block
            request.block_table.append(new_block_id)
            self.cache_misses += 1
        parent_hash = block_hash
    
    return request

Token Generation (generate_token)

  • Appends new token to request
  • Allocates new block when current block reaches capacity (token overflow)
  • Implements Copy-on-Write (CoW): creates new block if shared block has reference_count > 1
  • Writes in-place to exclusive blocks
def generate_token(self, request, new_token):
    request.tokens.append(new_token)
    last_block = self.physical_blocks[request.block_table[-1]]
    
    if len(request.tokens) % self.block_size == 1 and len(request.tokens) > 1:
        # Token overflow: allocate new block
        new_block_id = self.allocate_block()
        new_block = KVCacheBlock(block_id=new_block_id, reference_count=1)
        self.physical_blocks[new_block_id] = new_block
        request.block_table.append(new_block_id)
    
    elif last_block.reference_count > 1:
        # Copy-on-Write: create new block for exclusive modification
        new_block_id = self.allocate_block()
        last_block.reference_count -= 1
        new_block = KVCacheBlock(block_id=new_block_id, reference_count=1)
        self.physical_blocks[new_block_id] = new_block
        request.block_table[-1] = new_block_id

Request Cleanup (free_request)

  • Decrements reference count for all blocks in request
  • Deallocates blocks when reference count reaches 0
def free_request(self, request):
    for physical_id in request.block_table:
        if physical_id in self.physical_blocks:
            block = self.physical_blocks[physical_id]
            block.reference_count -= 1
            if block.reference_count == 0:
                self.free_block(physical_id)

Statistics (print_stats)

  • Cache hit rate calculation
  • Free and allocated block counts
def print_stats(self):
    total = self.cache_hits + self.cache_misses
    hit_rate = (self.cache_hits / total * 100) if total > 0 else 0
    print(f"Hits: {self.cache_hits}")
    print(f"Misses: {self.cache_misses}")
    print(f"Hit Rate: {hit_rate:.1f}%")
    print(f"Free Blocks: {len(self.free_blocks)}/{self.total_blocks}")
    print(f"Allocated Blocks: {len(self.physical_blocks)}")

Conclusion

Implementation: GitHub

Overall it was a fun project to implement and dive deep into PagedAttention.