Simulating vLLM's PagedAttention

An implementation of a simulation of PagedAttention in Python

Piyush Choudhari

December 8, 2025

6 min read

AIML

blog

PagedAttention

KVCache

vLLM

Why?

Anthropic recently released an article, Code execution with MCP: Building more efficient agents and it focused heavily on token conservation and how agents can use sandboxed environments to execute code to fetch results.

Idea of token/context management stuck with me and then I went down the rabbit hole of token management and optimization where I started studying vLLM inference runtimes. vLLM employs a ton of methods to optimize inference but I found PagedAttention very interesting, so I decided to simulate it.

The Problems

Nowadays LLM serving is one the most crucial components of applications. But still it faces a lot of challenges related to cost, performance and resource management.

Common challenges:

High computational load
GPU utilization
Latency issues
Inefficient memory management:
1. Internal fragmentation
2. Reservation waste
3. External fragmentation
Complexity of distributed inference

vLLM inference runtime aims to resolve these challenges and one of the core components it uses is PagedAttention

PagedAttention

In LLMs, the KV cache plays a important role during inference. As tokens are generated one by one, because LLMs are autoregressive, the model stores the key and value tensors from previous steps instead of recomputing them. This makes future token prediction faster because the model can reuse past computations.

Early implementations followed a memory allocation strategy borrowed from traditional deep learning workloads. Each inference request received a large, contiguous block of GPU memory reserved for its KV cache. The problem was that sequence lengths during inference are unpredictable. Some inputs end quickly, others expand due to long responses or continued user interaction. Since the system could not know the exact length ahead of time, it had to reserve more memory than necessary.

This led to several inefficiencies;

Most of the allocated memory remained unused, contributing to internal fragmentation.
Gaps appeared between allocated memory regions, creating external fragmentation.
Memory was often blocked for potential future tokens but never consumed.

Despite reserving large chunks of GPU memory, real usage often hovered between 20 and 40 percent. As a result, throughput suffered because the model could serve fewer simultaneous requests before exhausting memory. The hardware was capable of more, but the memory layout became the bottleneck.

PagedAttention redesigns KV cache storage using a paging mechanism inspired by operating systems. Instead of one large allocation, the cache is broken into smaller fixed-size chunks called KV blocks.

Key ideas introduced by PagedAttention:

Non-contiguous storage

Logical sequence:
[Block 1][Block 2][Block 3][Block 4]
 
Physical GPU layout:
[Block 1]       [Block 3] [Block 4]     [Block 2]
    ^ mapping handled by page table ^

On-demand allocation

Initial:
[Block 1]
 
As sequence grows:
[Block 1][Block 2][Block 3] ... allocated only when needed

Near-zero waste
Only the final incomplete block may have unused memory.
Page-table-style lookup

+-------------+-------------------------+
| Token Range | Physical Block Address  |
+-------------+-------------------------+
|   0 - 255   | Block_12                |
| 256 - 511   | Block_7                 |
| 512 - 767   | Block_19                |
+-------------+-------------------------+

Copy-on-write sharing for generation branches

During beam search or sampling, multiple candidates often share early history:

Before divergence:
Beam A → [B1][B2][B3]
Beam B → [B1][B2][B3]   (shared memory)
 
After divergence:
Beam A → [B1][B2][B3][B4]
Beam B → [B1][B2][B3][B4’]  (copy only when needed)

Higher throughput and scaling
Since memory is used efficiently, more requests fit into GPU memory, increasing batch size and token throughput.

Represents a physical block in KV cache
Properties:
- block_id: Unique identifier for the block
- reference_count: Number of requests sharing this block
- block_hash: SHA256 hash of block tokens for deduplication

class KVCacheBlock(BaseModel):
    block_id: int
    reference_count: int = 0
    block_hash: Optional[str] = None

Request

Represents a user request with tokens and allocated blocks
Properties:
- tokens: List of token IDs
- block_size: Size of each block
- block_table: List of physical block IDs allocated to this request

class Request(BaseModel):
    tokens: List[int]
    block_size: int
    block_table: List[int] = []

PagedAttention Class

Initialization

total_blocks: Maximum number of blocks available
block_size: Number of tokens per block
Maintains free blocks queue, physical block mappings, and hash table

def __init__(self, total_blocks, block_size):
    self.total_blocks = total_blocks
    self.free_blocks = deque(range(total_blocks))
    self.physical_blocks = {}
    self.hash_to_block = {}
    self.cache_hits = 0
    self.cache_misses = 0

Block Hashing (hash_block_tokens)

Creates SHA256 hash for token sequences
Incorporates parent hash for sequential block identification
Enables block-level deduplication

def hash_block_tokens(self, tokens, parent_hash):
    hasher = hashlib.sha256()
    if parent_hash:
        hasher.update(parent_hash.encode())
    for token in tokens:
        hasher.update(str(token).encode())
    return hasher.hexdigest()[:16]

Block Allocation (allocate_block)

Allocates free blocks from the pool
Returns block_id if available, None if out of memory

def allocate_block(self):
    if not self.free_block:
        return None
    block_id = self.free_blocks.popleft()
    return block_id

Block Deallocation (free_block)

Removes block from physical memory
Cleans up hash mappings
Returns block to free pool

def free_block(self, block_id: int):
    if block_id in self.physical_blocks:
        block = self.physical_blocks[block_id]
        if block.block_hash and block.block_hash in self.hash_to_block:
            del self.hash_to_block[block.block_hash]
        del self.physical_blocks[block_id]
    self.free_blocks.append(block_id)

Token Allocation (allocate)

Divides tokens into blocks (block_size tokens per block)
Implements cache hit logic: reuses existing blocks if hash matches
Implements cache miss logic: allocates new blocks for unique token sequences
Increments reference count for shared blocks

def allocate(self, tokens):
    num_tokens = len(tokens)
    num_blocks = (num_tokens + self.block_size - 1) // self.block_size
    request = Request(tokens=tokens, block_size=self.block_size)
    parent_hash = None
    
    for block_idx in range(num_blocks):
        block_tokens = tokens[start_idx:end_idx]
        block_hash = self.hash_block_tokens(block_tokens, parent_hash)
        
        if block_hash in self.hash_to_block:  # Cache hit
            cached_block = self.hash_to_block[block_hash]
            cached_block.reference_count += 1
            request.block_table.append(cached_block.block_id)
            self.cache_hits += 1
        else:  # Cache miss
            new_block_id = self.allocate_block()
            new_block = KVCacheBlock(block_id=new_block_id, reference_count=1, block_hash=block_hash)
            self.physical_blocks[new_block_id] = new_block
            self.hash_to_block[block_hash] = new_block
            request.block_table.append(new_block_id)
            self.cache_misses += 1
        parent_hash = block_hash
    
    return request

Token Generation (generate_token)

Appends new token to request
Allocates new block when current block reaches capacity (token overflow)
Implements Copy-on-Write (CoW): creates new block if shared block has reference_count > 1
Writes in-place to exclusive blocks

def generate_token(self, request, new_token):
    request.tokens.append(new_token)
    last_block = self.physical_blocks[request.block_table[-1]]
    
    if len(request.tokens) % self.block_size == 1 and len(request.tokens) > 1:
        # Token overflow: allocate new block
        new_block_id = self.allocate_block()
        new_block = KVCacheBlock(block_id=new_block_id, reference_count=1)
        self.physical_blocks[new_block_id] = new_block
        request.block_table.append(new_block_id)
    
    elif last_block.reference_count > 1:
        # Copy-on-Write: create new block for exclusive modification
        new_block_id = self.allocate_block()
        last_block.reference_count -= 1
        new_block = KVCacheBlock(block_id=new_block_id, reference_count=1)
        self.physical_blocks[new_block_id] = new_block
        request.block_table[-1] = new_block_id

Request Cleanup (free_request)

Decrements reference count for all blocks in request
Deallocates blocks when reference count reaches 0

def free_request(self, request):
    for physical_id in request.block_table:
        if physical_id in self.physical_blocks:
            block = self.physical_blocks[physical_id]
            block.reference_count -= 1
            if block.reference_count == 0:
                self.free_block(physical_id)

Statistics (print_stats)

Cache hit rate calculation
Free and allocated block counts

def print_stats(self):
    total = self.cache_hits + self.cache_misses
    hit_rate = (self.cache_hits / total * 100) if total > 0 else 0
    print(f"Hits: {self.cache_hits}")
    print(f"Misses: {self.cache_misses}")
    print(f"Hit Rate: {hit_rate:.1f}%")
    print(f"Free Blocks: {len(self.free_blocks)}/{self.total_blocks}")
    print(f"Allocated Blocks: {len(self.physical_blocks)}")

Conclusion

Implementation: GitHub

Overall it was a fun project to implement and dive deep into PagedAttention.