Virtual Memory & Paging - Why Your Computer Doesn't Crash When Running 100 Programs
The Lie Your Computer Tells Every Program
Here's something that bothered me for years when I was learning systems programming. I'd write a simple C program, compile it, and when I'd print the address of a variable, I'd get something like 0x00007ffd12345678. Then I'd run a completely different program, and it would also have variables at addresses starting with 0x00007ffd....
Wait, what? How can two different programs have variables at the same address? Aren't they fighting over the same memory location? Why doesn't everything explode?
The answer is one of the most beautiful illusions in computer science: virtual memory. Your OS is lying to every single program. Each program thinks it owns all the memory from address 0 to some huge number (like 2^48 bytes on 64-bit systems - that's 256 terabytes!). But in reality, they're all sharing the same physical RAM chips, and the hardware + OS are doing an incredibly sophisticated magic trick to make it work.
Understanding this took me weeks - reading OS textbooks, digging through Linux kernel code, and eventually writing my own simple paging implementation. Once it finally clicked, I realized it's one of the most elegant hacks in computing.
In this article, we'll explore what happens when your CPU tries to read memory, how the Memory Management Unit (MMU) translates addresses, what page tables actually look like in RAM, and why page faults aren't always errors. I'll build this up from first principles, because that's the only way it really makes sense.
Let's untangle virtual memory.
The Problem: Multiple Programs, One RAM
Let's start with the fundamental problem. Your computer has, say, 16 GB of physical RAM. But if you look at Task Manager or htop, you might see 50+ processes running, each with its own memory. How does that work?
The Naive Approach (That Would Be Terrible)
Imagine if programs directly accessed physical memory addresses:
Physical RAM (16 GB):
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Address What's There?
─────── ─────────────
0x00000000 ┐
│ Operating System Kernel
0x01000000 ┘
┐
│ Firefox (loaded at 0x01000000)
0x05000000 ┘
┐
│ Your Program (loaded at 0x05000000)
0x05100000 ┘
┐
│ Chrome (loaded at 0x05100000)
0x09000000 ┘
│
│ Free space...
│
0x400000000 ┘ (End of 16 GB)
This creates massive problems:
-
Memory conflicts: Program A wants to load at address 0x10000000. Program B also wants to load at 0x10000000. They can't both have it!
-
Security nightmare: Program A could read or write Program B's memory directly. Want to steal Chrome's password? Just read address 0x05100000. Game over.
-
Fragmentation: If you close Firefox, you have a hole from 0x01000000 to 0x05000000. If your next program needs 5 GB but in one contiguous block, you're stuck even if you have enough total free RAM.
-
No isolation: One buggy program with a wild pointer could corrupt the kernel, crash your entire system, or worse.
-
Fixed addresses: Programs would need to be compiled for specific memory addresses. Move a program to a different address? Recompile everything. Nightmare.
This is how computers worked in the 1970s. It was... not great.
Enter Virtual Memory: The Beautiful Illusion
Virtual memory solves all these problems with one clever idea: give every process its own private address space that's completely separate from physical RAM.
The Illusion: Each Process Gets Its Own "World"
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Process A's View: Process B's View:
┌──────────────────┐ ┌──────────────────┐
│ 0xFFFFFFFFFFFFFFFF│ │ 0xFFFFFFFFFFFFFFFF│
│ Kernel │ │ Kernel │
│ 0xFFFF800000000000│ │ 0xFFFF800000000000│
├──────────────────┤ ├──────────────────┤
│ │ │ │
│ (unmapped) │ │ (unmapped) │
│ │ │ │
├──────────────────┤ ├──────────────────┤
│ 0x7FFF00000000 │ │ 0x7FFF00000000 │
│ Stack │ │ Stack │
├──────────────────┤ ├──────────────────┤
│ Libraries │ │ Libraries │
├──────────────────┤ ├──────────────────┤
│ Heap │ │ Heap │
├──────────────────┤ ├──────────────────┤
│ Data/BSS │ │ Data/BSS │
├──────────────────┤ ├──────────────────┤
│ Code (.text) │ │ Code (.text) │
│ 0x0000000000400000│ │ 0x0000000000400000│
└──────────────────┘ └──────────────────┘
Both processes think they own addresses
starting at 0x400000, but they're actually
in DIFFERENT physical RAM locations!
Every process gets what looks like a massive address space (on 64-bit Linux, it's 128 TB of user space!). But here's the trick: most of those addresses aren't actually backed by real RAM. They're just numbers. Only the parts your program actually uses get mapped to physical memory.
Three Types of Addresses
This is where it gets important to be precise. There are actually three different types of addresses in a modern system:
-
Virtual Addresses (VA): What your program sees and uses. These are the addresses in your C pointers, in your debugger, everywhere in user-space.
-
Physical Addresses (PA): The actual addresses on the memory bus that go to the RAM chips. These are what the hardware uses.
-
Bus Addresses: Sometimes different from physical addresses on complex systems with multiple memory controllers or DMA. (We'll mostly ignore these.)
When your program does this:
int x = 42;
printf("Address of x: %p\n", (void*)&x);That address it prints? That's a virtual address. It's not the real location in RAM. It's an address in your process's private imaginary address space.
The Hardware: Memory Management Unit (MMU)
So how does the CPU convert virtual addresses to physical addresses? Meet the MMU (Memory Management Unit) - a piece of hardware built into every modern CPU.
CPU Memory Access With MMU:
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Your Program:
int* ptr = 0x00007ffd12340000; // Virtual address
int value = *ptr; // Read from memory
↓
CPU Core:
┌─────────────────────────────────┐
│ Load from 0x00007ffd12340000 │
└────────────┬────────────────────┘
│ Virtual Address (VA)
▼
┌─────────────────────────────────┐
│ Memory Management Unit (MMU) │
│ │
│ 1. Check TLB cache │
│ 2. If miss, walk page tables │
│ 3. Translate VA → PA │
│ 4. Check permissions │
│ │
└────────────┬────────────────────┘
│ Physical Address (PA)
▼
0x0000000089abc000 ← Real RAM location!
│
▼
┌─────────────────────────────────┐
│ Memory Bus │
└────────────┬────────────────────┘
│
▼
┌─────────────────────────────────┐
│ Physical RAM Chips │
│ [Actual memory data] │
└─────────────────────────────────┘
│
└──► Value returned to CPU
The MMU sits between your CPU core and the memory bus. Every single memory access goes through it. Every load, every store, every instruction fetch - all translated from virtual to physical addresses.
This happens in hardware, which is crucial. If this was done in software, every memory access would need OS intervention and your computer would be 1000x slower. The MMU does this translation at hardware speed, almost for free.
Pages: Memory in Fixed-Size Chunks
The MMU doesn't translate every single byte address individually - that would require an insane amount of bookkeeping. Instead, memory is divided into fixed-size blocks called pages.
On most systems:
- Page size: 4 KB (4096 bytes)
- Large pages (optional): 2 MB or 1 GB
Virtual Address Space Divided Into Pages:
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Virtual Address: 0x00007ffd12345678
│
▼
Split into two parts:
┌────────────────────────┬─────────────┐
│ Virtual Page Number │ Offset │
│ (VPN) │ │
│ 0x00007ffd12345 │ 0x678 │
└────────────────────────┴─────────────┘
52 bits (on x86-64) 12 bits
▲ ▲
│ │
│ └─ Offset within the page
│ (0-4095, since 2^12 = 4096)
│
└─ Which page? (the page number)
One page = 4096 bytes = 4 KB
Offset 0x678 = 1656 bytes into the page
The MMU translates page numbers, not individual byte addresses. The offset within a page stays the same in both virtual and physical addresses.
Why 4 KB Pages?
This isn't arbitrary. It's a trade-off:
- Too small (e.g., 512 bytes): Too many pages, page tables become huge, more overhead
- Too large (e.g., 1 MB): Internal fragmentation (waste), less flexible memory management
4 KB has been the sweet spot since the 1990s. Modern systems also support huge pages (2 MB or 1 GB) for performance in special cases like databases or VMs.
Page Tables: The Translation Lookup Structure
So how does the MMU know which virtual page maps to which physical page? There's a data structure in RAM called a page table that stores these mappings.
Page tables are stored in RAM. Yes, the same RAM you're trying to access. This creates a chicken-and-egg problem we'll get to shortly.
Simple (Naive) Page Table
The simplest page table is just an array:
Simple Page Table (Conceptual):
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Virtual Page Number → Physical Page Number + Flags
─────────────────── ──────────────────────────────
VPN 0x00000 → PPN 0x12345 [Present, R/W]
VPN 0x00001 → PPN 0x98765 [Present, R/W]
VPN 0x00002 → (not mapped) [Not Present]
VPN 0x00003 → PPN 0x3A2B1 [Present, R-only]
...
VPN 0x7FFFFFFFFFF → (end of user space)
Each entry (Page Table Entry) contains:
- Physical Page Number (PPN)
- Permission bits (Read, Write, Execute)
- Present bit (is this page in RAM?)
- Dirty bit (has it been modified?)
- Accessed bit (has it been read?)
To translate an address, the MMU:
- Extracts the Virtual Page Number from the virtual address
- Uses VPN as an index into the page table:
page_table[VPN] - Reads the Physical Page Number from the entry
- Combines PPN with the offset to get the physical address
Address Translation Process:
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Virtual Address: 0x00007ffd12345678
Step 1: Split the address
VPN: 0x00007ffd12345 (page number)
Offset: 0x678 (offset within page)
Step 2: Look up in page table
page_table[0x00007ffd12345] → PPN 0x89abc
Step 3: Combine PPN + Offset
Physical Address: 0x89abc000 + 0x678
= 0x89abc678
Step 4: Send to memory bus
Memory bus sees: 0x89abc678
RAM returns the data at that location
The Problem: Page Tables Would Be HUGE
On a 64-bit system with 4 KB pages, you'd need:
- 2^52 possible virtual pages (in a 64-bit address space)
- Each page table entry: 8 bytes
- Total page table size: 2^52 × 8 = 32 Petabytes
That's insane. Your OS would need 32 PB of RAM just to store the page table for ONE process! Clearly, the simple approach doesn't work.
Multi-Level Page Tables: The Real Solution
The solution is to use a hierarchical page table structure - a tree where you only allocate the parts you actually need.
On x86-64 Linux, there are 4 levels of page tables:
Four-Level Page Table Structure (x86-64):
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Virtual Address (48 bits used):
┌─────────┬─────────┬─────────┬─────────┬─────────────┐
│ PML4 │ PDPT │ PD │ PT │ Offset │
│ (9 bits)│ (9 bits)│ (9 bits)│ (9 bits)│ (12 bits) │
└─────────┴─────────┴─────────┴─────────┴─────────────┘
│ │ │ │ │
│ │ │ │ └─► Offset in page (0-4095)
│ │ │ └──────────────► Page Table index
│ │ └────────────────────────► Page Directory index
│ └──────────────────────────────────► PDPT index
└────────────────────────────────────────────► PML4 index
The Walk:
┌──────────────────────────────────────────────────────┐
│ CR3 Register (in CPU) points to PML4 table │
└──────────────┬───────────────────────────────────────┘
│
▼
┌───────────────┐
│ PML4 Table │ 512 entries (one per PML4 index)
│ [-------] │ Most are empty (not allocated)!
│ [ PTR ]───┐│
└───────────────┘│
│
▼
┌───────────────┐
│ PDPT Table │ 512 entries
│ [-------] │ Again, most empty
│ [ PTR ]───┐│
└───────────────┘│
│
▼
┌───────────────┐
│ PD Table │ 512 entries
│ [-------] │
│ [ PTR ]───┐│
└───────────────┘│
│
▼
┌───────────────┐
│ PT Table │ 512 entries
│ [-------] │
│ [0x89abc] │← Physical Page Number!
└───────────────┘
Why this is brilliant:
-
Sparse allocation: You only allocate tables for memory you're actually using. If a process uses 10 MB, you might only need a few page tables, not 32 PB!
-
Lazy allocation: Page tables can be created on-demand when memory is first accessed.
-
Each table is exactly one page: Each level table is 512 entries × 8 bytes = 4096 bytes = 1 page. This makes them easy to manage.
Walking the Page Table
When the MMU translates an address, it performs a page table walk:
Page Table Walk Example:
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Virtual Address: 0x00007ffd12345678
1. Extract indices:
PML4 = bits 39-47 = 0x0FF
PDPT = bits 30-38 = 0x1FE
PD = bits 21-29 = 0x091
PT = bits 12-20 = 0x145
Offset = bits 0-11 = 0x678
2. Read CR3 register → PML4 base address
Let's say CR3 = 0x001000000
3. PML4 lookup:
addr = 0x001000000 + (0x0FF × 8) = 0x0010007F8
Read 8 bytes → get PDPT address: 0x002000000
4. PDPT lookup:
addr = 0x002000000 + (0x1FE × 8) = 0x002000FF0
Read 8 bytes → get PD address: 0x003000000
5. PD lookup:
addr = 0x003000000 + (0x091 × 8) = 0x003000488
Read 8 bytes → get PT address: 0x004000000
6. PT lookup:
addr = 0x004000000 + (0x145 × 8) = 0x004000A28
Read 8 bytes → get PPN: 0x89abc (+ flags)
7. Construct physical address:
PA = (0x89abc << 12) | 0x678 = 0x89abc678
8. Access memory at PA 0x89abc678!
Notice something? To access one byte of memory, we had to do four memory reads just to walk the page table! This would be horribly slow... which is why we need the TLB.
The TLB: Translation Lookaside Buffer (The Speed Saver)
The TLB (Translation Lookaside Buffer) is a tiny, fast cache inside the CPU that stores recent virtual-to-physical address translations.
TLB Cache Structure:
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
TLB Entry:
┌──────────────┬──────────────┬─────────┬────────┐
│ VPN │ PPN │ Flags │ ASID │
├──────────────┼──────────────┼─────────┼────────┤
│ 0x7ffd12345 │ 0x89abc │ R/W │ 12 │
│ 0x7ffd12346 │ 0x89abd │ R/W │ 12 │
│ 0x400001 │ 0x23456 │ R/X │ 12 │
│ 0x7fffffff │ 0x12345 │ R/W │ 12 │
└──────────────┴──────────────┴─────────┴────────┘
▲ ▲ ▲ ▲
│ │ │ │
Virtual Physical Permission Process ID
Page Number Page Number bits (for isolation)
Typical sizes:
- L1 TLB: 64-128 entries (data) + 128 entries (instruction)
- L2 TLB: 512-1024 entries
Hit rate: Usually 95-99% in normal workloads!
Memory access with TLB:
Fast Path (TLB Hit):
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
1. CPU wants to access VA 0x7ffd12345678
2. MMU extracts VPN: 0x7ffd12345
3. Check TLB: Is VPN 0x7ffd12345 cached?
4. YES! TLB has: VPN 0x7ffd12345 → PPN 0x89abc
5. Construct PA: 0x89abc678
6. Access memory directly
Total time: ~1-2 CPU cycles
No page table walk needed!
Slow Path (TLB Miss):
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
1. CPU wants to access VA 0x12345000
2. MMU extracts VPN: 0x12345
3. Check TLB: Is VPN 0x12345 cached?
4. NO! TLB miss!
5. Perform 4-level page table walk (4 memory reads)
6. Find PPN: 0xABCDE
7. Add translation to TLB (evict old entry if full)
8. Construct PA: 0xABCDE000
9. Access memory
Total time: ~100-200 CPU cycles (much slower)
But now it's in TLB for next time!
The TLB hit rate is critical for performance. If your program accesses memory randomly across many pages, you'll have lots of TLB misses and everything slows down. This is why cache-friendly programming (accessing memory sequentially) matters!
Context Switches and TLB Flushes
When the OS switches from Process A to Process B, the TLB needs to be invalidated - otherwise Process B would see Process A's address translations!
Modern CPUs use ASID (Address Space ID) tags in TLB entries, so some entries from different processes can coexist. But full context switches still often require TLB flushes, which temporarily hurt performance until the TLB warms up again.
Page Faults: When Memory Isn't There
Here's where things get really interesting. What happens if you access a virtual address that isn't currently mapped to physical RAM?
A page fault occurs. And contrary to what the name suggests, it's not always an error!
Page Fault Types:
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
1. Minor Page Fault (Soft):
The page is allocated but not yet mapped in page table
→ OS maps it, no disk I/O needed
→ Fast (~microseconds)
2. Major Page Fault (Hard):
The page is swapped to disk or not allocated yet
→ OS must read from disk or allocate new page
→ Slow (~milliseconds, 1000x slower!)
3. Invalid Page Fault (Segmentation Fault):
Access to unmapped memory that shouldn't be accessed
→ OS kills the process (SIGSEGV)
→ "Segmentation fault (core dumped)"
The Page Fault Handler
When a page fault happens, the CPU:
- Stops executing your program
- Saves the faulting address in a special register (CR2 on x86)
- Jumps to the OS kernel's page fault handler
- The OS decides what to do
Page Fault Flow:
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
User Program:
int* ptr = 0x7ffd12345000;
*ptr = 42; // Trigger page fault!
│
▼
CPU detects: Page not present
│
▼
┌──────────────────────────────┐
│ HARDWARE (Automatic) │
│ │
│ 1. Save address to CR2 │
│ 2. Save current state │
│ 3. Jump to page fault handler│
└──────────────────────────────┘
│
▼
┌──────────────────────────────┐
│ KERNEL (Software) │
│ │
│ Read CR2 register │
│ Determine fault type │
│ │
│ Is address valid? │
│ ├─ NO → Kill process (SIGSEGV)
│ └─ YES: │
│ Is page swapped? │
│ ├─ YES → Load from disk │
│ └─ NO → Allocate new │
│ │
│ Update page table │
│ Flush TLB if needed │
└──────────────────────────────┘
│
▼
Return to user program
Retry the instruction
This time it works!
Demand Paging: Lazy Allocation
Here's a crazy optimization: when you allocate memory (e.g., malloc(1000000) for 1 MB), the OS doesn't actually give you physical RAM immediately!
Demand Paging in Action:
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
char* buf = malloc(1000000); // Request 1 MB
What actually happens:
1. OS marks 244 virtual pages as "allocated"
(1,000,000 bytes ÷ 4096 bytes/page ≈ 244 pages)
2. OS does NOT allocate physical RAM yet!
3. Page table entries marked: "Present = 0"
4. malloc() returns immediately (fast!)
Now you write to the memory:
buf[0] = 'A'; // First write
5. Page fault! (Present bit = 0)
6. OS allocates ONE physical page (4 KB)
7. OS updates page table: map virtual → physical
8. Instruction retries, write succeeds
buf[100000] = 'B'; // Write to different page
9. Page fault again!
10. OS allocates another physical page
11. Updates page table
12. Write succeeds
Result: Only the pages you actually touch
get physical RAM allocated!
This is why malloc() is so fast - it's not actually allocating memory, just reserving address space! Physical pages are allocated on-demand when you first access them.
Copy-On-Write: The fork() Magic
Here's one of the coolest uses of virtual memory: copy-on-write (COW).
When you call fork() to create a child process, the OS doesn't copy all the memory:
Fork With Copy-On-Write:
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Parent Process:
Virtual Memory Physical Memory
┌──────────┐ ┌──────────┐
│ Page 0 │────────► │ Frame A │
│ Page 1 │────────► │ Frame B │
│ Page 2 │────────► │ Frame C │
└──────────┘ └──────────┘
R/W permissions
fork() happens!
Parent Process: Child Process:
Virtual Memory Virtual Memory
┌──────────┐ ┌──────────┐
│ Page 0 │──┐ ┌──│ Page 0 │
│ Page 1 │──┤ ├──│ Page 1 │
│ Page 2 │──┤ ├──│ Page 2 │
└──────────┘ │ │ └──────────┘
│ │
▼ ▼
┌──────────┐
│ Frame A │ ← SHARED!
│ Frame B │ ← Both point to same RAM
│ Frame C │
└──────────┘
Both page tables marked: Read-Only (R/O)
Now parent writes to Page 1:
page1[0] = 42;
Page fault! (write to read-only page)
OS does:
1. Allocate NEW physical frame (Frame D)
2. Copy Frame B → Frame D
3. Update parent's page table: Page 1 → Frame D (R/W)
4. Child still points to Frame B
5. Retry write, succeeds
Parent: Child:
┌──────────┐ ┌──────────┐
│ Page 1 │─┐ │ Page 1 │─┐
└──────────┘ │ └──────────┘ │
│ │
▼ ▼
┌────────┐ ┌────────┐
│Frame D │ │Frame B │
│(new) │ │(old) │
└────────┘ └────────┘
This is why fork() is so fast in Unix/Linux. Even if the parent has gigabytes of memory, fork() returns instantly because it's not actually copying anything - just creating new page tables that share the physical pages (marked read-only). Physical copies only happen when someone writes to a page.
Memory Isn't Always RAM: The Swap Space
Here's the final mind-bender: "memory" doesn't necessarily mean RAM.
When physical RAM is full, the OS can write pages to disk (the swap space or page file) to free up RAM for other things.
Memory Hierarchy:
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Virtual Address Space (per process):
↓
Page Tables (mapping):
↓
Physical RAM (actual DRAM chips):
↓
Swap Space (on SSD/HDD):
Example:
Process has 1000 pages allocated
├─ 200 pages in physical RAM (fast)
├─ 50 pages swapped to disk (slow)
└─ 750 pages not yet accessed (not allocated)
When you access a swapped page:
Major Page Fault (Swapped Page):
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Program: data = array[1000000];
1. MMU translates virtual address
2. Page table shows: "Present = 0, Swapped = 1"
3. Page fault!
4. CPU jumps to kernel
5. Kernel checks: Where is this page?
6. Reads swap metadata: "On disk, sector 12345"
7. If RAM is full:
a. Pick a victim page (LRU, Clock algorithm)
b. If victim is dirty, write it to swap
c. Free the physical frame
8. Read page from disk → physical frame
(This takes ~5-10 milliseconds - SLOW!)
9. Update page table: Present = 1, Physical frame = X
10. Flush TLB entry
11. Return to user program
12. Retry instruction, success!
User program has no idea this happened!
Just notices it was slow for a moment.
This is why your computer doesn't crash when you open 50 Chrome tabs with only 8 GB of RAM. The OS is constantly shuffling pages between RAM and disk, making it look like you have unlimited memory.
The price: Swapping is ~100,000x slower than RAM. If you're swapping heavily ("thrashing"), your computer becomes unusable.
Permission Checking: Why Segfaults Happen
Page table entries include permission bits:
Page Table Entry Flags:
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Bit Name Meaning
─── ──── ───────
0 Present Page is in physical RAM (not swapped)
1 R/W Read/Write (1) or Read-Only (0)
2 U/S User (1) or Supervisor/Kernel (0)
3 PWT Page Write-Through (cache control)
4 PCD Page Cache Disable
5 Accessed Page was read or written
6 Dirty Page was written to
7 PS Page Size (4KB or 2MB/1GB)
8-11 Available OS can use these
12-51 PFN Physical Frame Number
52-62 Available OS can use these
63 NX No-Execute (cannot run code from this page)
When the MMU translates an address, it checks these flags:
Permission Check:
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Trying to: Write to address 0x400000 (code section)
1. MMU translates: VA 0x400000 → PA 0x123000
2. Reads page table entry flags:
- Present: 1 (in RAM)
- R/W: 0 (Read-only!)
- U/S: 1 (User accessible)
- NX: 0 (Executable)
3. Operation: WRITE
4. Permission: Read-only
5. MISMATCH! → Page Fault
6. CPU jumps to kernel
7. Kernel sees: invalid write to read-only page
8. Kernel sends SIGSEGV to process
9. Process crashes: "Segmentation fault"
This is why writing to string literals crashes:
char* str = "Hello"; // String literal in read-only memory
str[0] = 'h'; // SEGFAULT! Write to read-only pageThe compiler put "Hello" in the .rodata (read-only data) section, which gets mapped with R/W=0. The MMU catches your write and kills your process.
Kernel Space vs User Space
One more critical detail: the virtual address space is divided between user and kernel:
64-bit Virtual Address Space Layout (Linux):
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
0xFFFFFFFFFFFFFFFF ┐
│
│ Kernel Space (~128 TB)
│ - Kernel code and data
0xFFFF800000000000 │ - Page tables
│ - Device drivers
┘ U/S bit = 0 (supervisor only)
┐ Canonical hole
│ (invalid addresses)
0x00007FFFFFFFFFFF ┘
0x00007FFFFFFFFFFF ┐
│
│ User Space (~128 TB)
│ - Your program
│ - Stack, heap, libraries
0x0000000000000000 ┘ U/S bit = 1 (user accessible)
Key point: The kernel's page tables are mapped into every process's address space (in the upper half), but with U/S=0. This means:
- User code cannot access kernel memory (MMU blocks it)
- When you make a system call, the CPU switches to kernel mode and suddenly has access
- The kernel can access your user memory (for
read(),write(), etc.) - Page tables for kernel space are shared across all processes (mapped the same way)
This is why system calls work - the CPU doesn't need to completely change address spaces, just flip a mode bit.
Putting It All Together: A Memory Access Story
Let's trace a complete memory access from start to finish:
int* ptr = malloc(sizeof(int)); // Allocate memory
*ptr = 42; // Write to it
int value = *ptr; // Read from itStep 1: malloc()
malloc(sizeof(int)):
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
1. malloc() asks kernel for memory (brk/mmap syscall)
2. Kernel marks one virtual page as allocated
- No physical RAM allocated yet!
- Page table entry: Present=0, flags set
3. malloc() returns virtual address: 0x7ffd12345000
4. Total time: ~1 microsecond
Step 2: *ptr = 42
Write Operation:
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
1. CPU executes: MOV [0x7ffd12345000], 42
2. MMU extracts VPN: 0x7ffd12345
3. Check TLB: Miss (first access to this page)
4. Start page table walk:
- Read PML4[0x0FF] → PDPT address
- Read PDPT[0x1FE] → PD address
- Read PD[0x091] → PT address
- Read PT[0x145] → Entry shows Present=0
5. PAGE FAULT!
6. CPU jumps to kernel page fault handler
7. Kernel sees: valid address, not present
8. Allocate physical page: get frame 0x89abc
9. Zero the page (security - don't leak data)
10. Update PT[0x145]: PPN=0x89abc, Present=1, R/W=1
11. Add to TLB: VPN 0x7ffd12345 → PPN 0x89abc
12. Return to user mode
13. Retry instruction
14. TLB hit! Translate to PA 0x89abc000
15. Memory bus write: PA 0x89abc000 ← 42
16. Set Dirty bit in page table entry
17. Success!
Total time: ~100 microseconds (due to page fault)
Step 3: value = *ptr
Read Operation:
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
1. CPU executes: MOV EAX, [0x7ffd12345000]
2. MMU extracts VPN: 0x7ffd12345
3. Check TLB: HIT! (we just accessed this page)
4. TLB returns: PPN 0x89abc
5. Construct PA: 0x89abc000
6. Memory bus read: PA 0x89abc000 → 42
7. Return value to CPU register
8. Set Accessed bit in page table entry
Total time: ~3 nanoseconds (TLB hit)
Notice the massive speed difference:
- First access with page fault: ~100,000 ns
- Second access with TLB hit: ~3 ns
- That's 30,000x faster!
Why This Matters For Programmers
Understanding virtual memory changes how you write code:
1. malloc() Is Cheap, Touching Memory Is Expensive
// This is fast:
char* buf = malloc(1000000000); // 1 GB, instant!
// This is slow:
memset(buf, 0, 1000000000); // Touch every page, page faults!2. Memory Access Patterns Matter
// Cache-friendly (sequential):
for (int i = 0; i < 1000000; i++) {
array[i] = i; // Few TLB misses
}
// Cache-hostile (random):
for (int i = 0; i < 1000000; i++) {
array[rand() % 1000000] = i; // Many TLB misses
}3. Page Size Affects Performance
If you're working with huge datasets (databases, scientific computing), using huge pages (2 MB instead of 4 KB) can reduce TLB misses:
// Allocate with huge pages on Linux:
void* ptr = mmap(NULL, size, PROT_READ | PROT_WRITE,
MAP_PRIVATE | MAP_ANONYMOUS | MAP_HUGETLB,
-1, 0);4. Fork Is Fast, Writes After Fork Are Slow
if (fork() == 0) {
// Child process
// First write to any page → copy-on-write
global_array[0] = 1; // Page fault, copy happens
}The Beautiful Lie, Revealed
So there it is - the beautiful lie your OS tells every program:
- "You own all the memory!" → No, you share it with everyone, mediated by page tables
- "Addresses are real!" → No, they're virtual, translated by the MMU
- "malloc() gives you memory!" → No, it reserves address space; physical memory comes later
- "Memory is fast!" → Only if it's in RAM; might be on disk (swap)
- "Addresses are unique to your program!" → Yes, but the physical locations aren't
This elaborate illusion is what makes modern computing possible. Without virtual memory:
- No memory protection (programs could corrupt each other)
- No swap space (you'd be limited to physical RAM)
- No efficient fork() (copying gigabytes would be slow)
- No shared libraries (every program loads its own copy)
- No ASLR (security feature that randomizes addresses)
Virtual memory is one of those fundamental abstractions - like file systems, or the network stack - that's so good, so essential, that we forget it's even there. Until you try to understand why your program segfaults, or why it suddenly got slow, or why malloc() succeeds even when you're out of RAM.
Now you know the truth behind the illusion.
Going Deeper
If you want to explore this hands-on:
- Watch page faults:
ps -o min_flt,maj_flt -p <pid>
# min_flt = minor faults (no disk)
# maj_flt = major faults (disk I/O)- See your page tables:
cat /proc/<pid>/maps # Virtual memory layout
cat /proc/<pid>/smaps # Detailed page statistics- Monitor TLB stats:
perf stat -e dTLB-load-misses,iTLB-load-misses ./your_program- Force page faults:
char* buf = malloc(1000000);
// No faults yet
memset(buf, 0, 1000000); // Now faults happenWant to see the page table walking code? Look at the Linux kernel source: arch/x86/mm/fault.c (page fault handler) and mm/memory.c (page table operations). It's dense, but now you know what it's doing.
Happy hacking, and remember: every pointer you use is a beautiful lie.
P.S.: If you're wondering about other memory mysteries - like why glibc's malloc() is so complicated, or how huge pages work in practice, or what NUMA means for memory access times - let me know. There's always another layer to peel back.