Possibly inaccurate. Possibly incomplete. My notes may have errors or be incomplete.
You will almost certainly find better notes than mine somewhere on the internet. Maybe you should be using those instead of mine.
Maybe the best way to use these notes is just to reference important concepts and see how they tie into each other, and the video lectures.
The course is designed more from an academic's point of view, than a software developer's point of view. It can hence lean more towards theory than practice. I haven't done the assignments.
fork() is the primary way of creating new process in linux, Creates a child process with cloned program, execution flow, and the whole environment, but assigns a new pid, and both processes run at same time (they're "logically" running simultaneously even if not "physically", OS can share the same physical core for them by taking turns). after calling fork(), check if pid is non-zero to identify if you are parent or child now.
wait() lets parent process process exit code of child after it called exit()
exec() also creates new process but it terminates the parent process
default fd: 0 stdin 1 stdout 2 stderr, rest are user assignable
/proc/{pid} is pseudofilesystem in RAM that lets processes store information, or OS to provide information to process
threads share address space but have distinct execution flow
kernel-level threads - OS is aware, gives physical concurrency
user-level threads - OS is unaware, user manages scheduling, context switching is cheaper (no OS syscall needed), scheduling policy can be designed custom to a specific application, user-level thread-based code can run on machines that don't support kernel-level multithreading
process leak - child process calls exit() but parent process has not called wait() to get the exit code, hence OS will keep the exit code of child in memory (potentially for months)
orphaned processes - if parent calls exit() before child calls exit(), child now has init process as parent and init will repeatedly call wait() to cleanup all child processes
At hardware level, hardware bus interfaces between CPU, disk, RAM, etc
RAM has a memory management unit (MMU) that does some memory accounting, address translation, access control etc
x86 8086 architecture
16-bit registers, but can also use half a register for 8-bit values
Ax, BX, DX, CX - 4 standard 16-bit registers
SP, BP, SI, DI - 4 standard 16-bit registers
PC - program control 16-bit register - stores reference to next instruction to run
FLAGS - register to track floating point overflows, positive/negative values etc, system-level information - cannot be written directly
segment registers - add 4-bits more to PC, so that total 20-bits can be used to address 2^20 memory locations
later extended to 32-bit architecture with backwards compatibility (for existing OSes and compilers written on older hardware)
EAX, EBX, EDX, ECX - 4 standard 32-bit registers
AX, BX, DX, CX refers to lower 16-bits of these registers
today 64-bit architectures are the norm as RAM grew bigger than 4 GB (2^32 bytes) eventually
AT&T syntax => C-ish equivalent
movl %eax, %edx => edx=eax; # store contents of eax
movl $0x123, %edx => edx=123; # store 123 as a constant value
movl 0x123, %edx => edx=(int)123; # treat 123 as an address, dereference it, store contents
movl (%ebx), %edx => edx=(int)ebx # treat contents of ebx as an address, dereference it, store contents
movl 4(%ebx), %edx => edx=(int)(ebx+4) # treat contents of ebx as address, add 4 bytes offset to get another address, dereference it, store contents
"l" indicates 32-bit or 64-bit, this gets converted to 0x66 or 0x67 or nothing as needed when written in binary
push/pop contents of stack. %esp is stack pointer.
pushl %eax
popl %eax => movl (%esp), %eax; add $4, %esp
call $0x12345 => pushl $eip; jmp 0x12345 # push incremented instruction pointer (/program control) to stack, jump to instruction stored at specified address
ret => popl %eip; # function return i.e. pop previously stored instruction pointer from stack and use it to continue the program
[These are fake instructions for explanation purpose, %eip is not accessible in real life]
GCC calling convention:
at function entry (call), %eip points to 1st instruction in function, %esp points to return address, (%esp+4) points to first argument to function
at function exit (ret), %eip points to return address, %esp points to arguments pushed by call, %eax contains return value
function can mutate or trash %eax, %edx, %ecx (caller-saved registers)
function cannot mutate %ebx, %ebp, %esi, %edi (callee-saved registers)
ebp points to esp position at start of function call, esp is now free to be used for running the contens of the function
ebp can be used to get function arguments anywhere in between the function run, also ebp is need after exiting the function
function entry: push %ebp; mov %esp, %ebp
function exit: mov %ebp %esp, popl %ebp, ret
ebp is not strictly needed because compiler can remember how many items the function run has added to the stack, so it can calculate ebp from esp at any time. but it is convenient
GCC compiler workflow:
.c files --compiler--> .s assembly files
--assembler--> .o object files (binary)
--linker--> one single binary file
--loader--> starts execution of this binary
I/O space
1024 adresses
only allows IN/OUT instructions
Physical address space
Includes low memory, BIOS, VGA memory, extended memory, memory-mapped devices
Most of the memory is extended memory
Memory-mapped devices belongs to connected devices (such as a printer) and instructions run on those registers help communicate with that device
memory translation
a process can individually use addresses 0,1,2,3, etc
MMU translates it to base + va, assuming base + va < limit, limit is assigned for each process
then va gets translate to pa, pa = va + segment * 16
the executable of the process can assume it starts from address 0 and ends at limit, MMU will do the address translation needed
Global descriptor table (GDT) has 2^13 entries. GDT stores base and limit values. A subset of these are accessible to each process. process can set its selectors AS, BS, DS, CS, FS, GS to any of the allowed values in the GDT.
privilege levels
last 2 bits of CS register determine privilege level (aka ring level), how much of GDT is accessible. 00 and 11 are commonly used privilege levels. usually OS processes are 11, and rest are 00
ljmp $selector, $offset - sets cs (memory selector) and eip (instruction pointer), used to read from GDT
GDTR register stores start address of GDT, so the OS knows where it is. IDTR stores start of IDT (see below)
Important: in practice, selectors end up being cached in CPU so don't need to access them from RAM via bus everytime
interrupt descriptor table
interrupt descriptor table (IDT) contains entries with cs (selector) and eip (instruction pointer)
processes can set interrupts in CPU, when an interrupt is triggered the IDT gets read to know where to jump execution to
commonly used for:
timers
external device sends interrupt
CPU raises exception, such as div by zero or segfault
system calls, such as read or write to a device - needs privilege escalation hence interrupt is used
the process handling the interrupt commonly sends a signal (SIG_STOP, SIG_INT, etc) to the original process. the original process then has default or user-defined signal handler that does some work
on interrupt, CS and EIP change, SS and ESP might change, otherwise the address space remains same as the original process
whenever privilege escalation occurs - for example an interrupt from user program calls OS-defined handler - then copy SSO and ESP0 to SS and ESP respectively, so you avoid using SS and ESP of the user program (which are untrusted), then push the old values to the new stack (kernel stack)
two stacks - user stack, kernel stack. user stack is untrusted
old CS, old EIP, flags stored in stack on interrupt
iret command is used after interrupt handling is done, to pop the stack values to CS and EIP
problems with just doing memory segmentation as discussed so far:
memory fragmentation is possible - malloc(size) and free(ptr) called multiple times can leave holes in the memory
process growth can cause fragmentation - memory required by process can grow at runtime, leading to fragmentation
multiple segments is harded to program - programming model (which compiler must be aware of) becomes complicated if multiple segments can be accessed by the same process using multiple segment selectors
Therefore: Paging
f maps virtual page nos to physical page nos
p_addr = f(v_addr // 2^p) >> p + v_addr % 2^p
typical page size = 2^12 bytes = 4 KB
typical virtual memory size = 2^32 bytes = 4 GB (approx)
=> typical number of pages = 2^20
=> depth=2 hierarchy to map virtual to physical pages => 2^10 entries in page directory, 2^10 entries in each page table
=> 2^12 bytes are covered by a single entry in page table
segmentation is no longer used in 64-bit architecture, paging is the primary method
segmentation is hardware-based, paging is software-based
privilege levels
when doing paging, user processes and kernel share same address space. typically addresses above 3 GB (in a 4 GB virtual memory) are reserved for kernel. EIP is checked, not CS.
page directory and page tables have a flag indicating privilege level needed to access given v_addr
demand paging (optimisation)
pages are created in advice, in case they're needed
copy-on-write (optimisation)
share pages instead of copying them, but keep them read-only
if either processes attempts to write, make the copy and overwrite the copy
Translation lookaside buffer (TLB)
TLB stores most commonly accessed entries in page tables so you don't need to go through page tables (2 lookups) everytime
TLB typically has hit ratio >99% in most computer usage
Large pages
Usually pages are small (4 KB), but you can also assign a large page to a process (4 MB) if you know the process won't waste most of it. (Recovering this wasted space means OS must handle memory fragmentation better. Paging was introduced to avoid memory fragmentation.)
Kernel is assigned large pages, for example.
Large pages reduce number of times TLB (or page tables) must be accessed
Boot sector
When machine is powered on, hardware loads first 512 bytes from disk sector 0 to memory address 7C00
get segment and offset at which each of them must be loaded, and get address and load them
if memory size exceeds file size, store zero in rest of the memory
get entry point from elf header, and use it. (This is the kernel)
this will never return, assuming OS has no bug
after compilation: Kernel is executable in elf format, so it will run now
$ objdump -p kernel # get elf program headers
LOAD: vaddr 0x80100000, paddr 0x100000, align 2^12, filesz 0xb596, memsz 0x126fc
# ignore first 1 MB of physical memory which is for console
# load these many bytes from disk to memory in 2^12 size page table
$ objdump -f kernel # get elf entry address from elf header
0x10000c
# first instruction to execute
Quickly skimming through next few lecture notes
Possibly important points:
Security rules for dereferencing user pointers - user processes should not be allowed to access certain parts of address space. How to check some rules without degrading performance.
Concurency, locking
Concurrency notes
If two threads must concurrently read or write the same variable, you may need to "lock" one variable while one thread is finished doing all its steps involving that variable. Otherwise the logic of both threads working together can cause logical error. For example two threads simultaneously doing x = x + 1 with initial x = 0 may both output 1 instead of the correct answer 2.
This problem occurs the same whether a thread is switched out for another thread by the OS (for any reason) on a single CPU processor, or if multiple CPU processors try to access the variable at the same time.
Problem for both user (unprivileged) threads and kernel (privileged) threads
Any variable can be locked by a thread. Another thread must check whether that variable has been locked by a thread before accessing it.
There can be race conditions and hence non-determinstic order in which thread is acquired. This is correct and maximises performance. Forcing determninistic order of acquiring the lock would reduce performance.
Two processes (in multiprocessor system) cannot acquire a variable at same instant in time because of "atomic read and write" instruction provided by hardware. This is called "exchange" instruction. For a given variable, there's an extra bit indicating if it's locked or not. A process can read and write this bit atomically in a single instruction. Hence two processes won't read the bit at the same time, and then race to who writes first.
Spin lock - whichever thread fails to get the lock keeps spinning in endless loop until it finally gets the variable in unlocked state. This is often less wasteful than calling "yield" to release the processor and pause that thread. Yield requires switching address space etc which is more CPU instructions than just spinning (a few instructions of that threads per say thousand instructions of the other thread).
On single processor system - yield is always better. Until context is switched to other thread, lock will never release. Spinning is wasteful
On multiprocessor system - optimal is spin for some number of iterations, and then yield if lock is not released yet
Fine-grained locking - instead of having one global lock for each variable, it could be given a fine-grained lock. This records the id (PID?) of source thread and destination thread, which are supposed to lock and acquire the lock respectively.
Problem: cyclic dependency is possible if not careful. Two threads are both waiting for each other to release a lock on two separate variables.
One solution: Followed by most OSes. There's a serial order in which all locks are arranged, and a thread can only acquire a lock with greater serial position than the previously acquired locks by other threads. (Any thread can release any lock anytime.)
Good discipline for programmers (as per prof) - first decide how fine-grained your locking must be - global locks, locks per data structure and if yes which data structure. Then decide a serial order for the locks and enforce that only a bigger position can be acquired. This is not the only way to guarantee code correctness but it is one common way.
(Disk-based) Databases can do much better concurrency control than memory, because disk accesses are much slower than memory accesses and hence one can do a spend a lot more cycles on concurrency-related stuff in parallel with the disk access.
Ensuring atomicity of acquiring a lock, if implemented naively, forces memory access to go across the bus instead of just the memory cache. Cache coherence protocol avoids this.
Compiler optimisation of code involving locking can be a problem. Compiler can for example reorder two separate read intructions where the compiler is unaware that the order matters, or avoid doing a write if it notices the same variable is being read (and then rewritten) soon after being written. You can use "volatile" keyword to switch off compiler optimisation on code fragments (in gcc right??)
Hardware optimisation is even bigger problem than compiler optimisation. Hardware can also reorder multiple memory accesses that it thinks are independent. Chip performance is significantly increased by this, as cache hit-dependent code is processed before cache miss-dependent code. Important example: Intel. You can use "fence" instructions to disable out-of-order memory access by hardware on code fragments.
while accessing the lock table itself, on multiprocessor system, all other threads need to be spin locked (i.e. in infinite loop waiting for a lock to be released).
recursive locks are a bad idea, don't allow a function to acquire a lock twice or maintain a counter for how many times it was acquired.
try lock - good idea - attempt to acquire a lock, if fails, don't spin and wait for it, instead do something else
Disk
time taken to fetch disk sector
rotational latency - rotate disk to reach useful track
seek time - travel along useful track to reach disk sector
random access and serial access have different times taken
possible to store data smartly to reduce seek time (within a disk track)
Example: increase throughput of static web server hosted on multicore machine
need multiple threads (1 o 2 per core) that can take requests from the network input queue and respond to them
need locks for input queue
need "conditional variables", if all threads are waiting on the CPU, then continue a thread only once another thread completes its CPU task
need in-memory cache for the disk data
this cache needs locks. typically there is one lock on the entire cache, and a second level of locks - one per page.
need to ensure atomicity in all these above operations - this code is difficult to write correctly
Useful concurrency abstractions
locks - one thread can lock a data structure
conditional variables - one thread can be paused until some binary flag is changed
semaphore - conditional variable but using integer counter instead of binary, a thread can be paused until counter hits some value
Conditional variables
wait and notify functions
A thread can call wait and it'll remain paused until another thread calls notify on it
Semaphore
conditional variable but integer instead of binary
any thread can atomically decrement counter (but not below zero) or increment it (but not above specified max value of that counter)
a thread can pause until the counter hits some value. (Usually wait until it is below its max value, i.e., ensure some queue is no longer full)
Good practices on when to use sempahores versus locks:
https://barrgroup.com/blog/mutexes-and-semaphores-demystified
Some concurrency notes
Interchangeability
Semaphore with counter max=1 can implement lock
Locks + conditional variables can implement semaphore
Monitor
abstraction inside a programming language that makes compiler aware of the concurrency-based instructions you're using (like locks, semaphores etc), otherwise it is possible to use them without making the compiler aware you're doing this
Naive approach: lock variables, operate on them, store result, unlock variables
Transactional approach: make copies of variables, operate on the copies, and then keep trying to store final result in original variables until success
If probability that another thread touches the data at the same time is low, then transactional approach can run faster than locking at the start
Potential problem: live lock - another thread changes the input variables while this thread is running, now it can no longer write to the output variable as itll break atomicity. it may have to re-execute the entire code on new input variables, which is costly. Transactional approach works best when live lock probability is very low.
Transactions are useful for disk access, because disk access is slow and the additional CPU work due to occasional live lock may not be significant compared to disk seek time.
When?
Writing correct transactional code is easier than writing correct lock-based code, programmer time is saved
Database developers prefer transactional approach (because of disk access time and code correctness being more important), operating system developers prefer lock-based approach (because performance is most important)
Intel Haswell supports doing transactional approach on memory access using hardware rather than software, for increased performance.
Compare and swap instruction (aka CAS)
atomically do: if R_current=R_old, R_current=R_new
operating systems support this transactional instruction
allows multiple threads to share a piece of memory, and atomically write to it without being messed up by other writes
common usage pattern: copy old value first, do something useful with it, then when you do compare and swap you're ensuring that the old value has not been changed by some other thread first
Reader-writer lock
Special lock that separates read-acquire from write-acquire
Useful when there are a lot of threads that only read but don't write
A shared variable can be acquired using write-lock only one thread at a time, but it can be acquired using read-lock by multiple threads at the same time. Also it can be acquired by a read-lock and a write-lock at the same time. (A read and write at same time can conflict, and two writes at same time can conflict.)
xv6 implementation of concurrency and multiprocess communication
locks - implemented
conditional variables - implemented with small differences - sleep/wakeup instead of wait/notify
semaphores - not implemented
signals - not implemented
parent:fork+wait, child:exit - trivially implemented using sleep/wakeup
initlock - name, cpu for debugging only, locked indicates state of lock, initialised to zero
acquire - disables interrupts, then use xchg to atomically set lock state to 1, then output which process and cpu called the lock for debugging purpose, interrupts remain disabled even after acquire returns
pushcli() instead of cli() because pushcli() counts how many times cli has been called, useful as xv6 supports nested locks
interrupts must remain disabled so that interrupt handler does not access shared data, it'll remain disabled until lock is released
release - set cpu back to zero, use xchg to release lock, then popcli to enable interrupts
xchg used to ensure instructions are not reordered by compiler or hardware
global invariant across OS - once ptable.lock is acquired, no other lock can be acquired. (see earlier notes on why order of lock acquisition is important to prevent deadlock, i.e., two locks circularly waiting on each other) both acquire and release functions themselves (implemented by kernel) need to acquire ptable.lock to ensure their own atomicity
xv6 implementation of sleep/wakeup discussed in detail, did not make detailed notes
in short, sleep and wakeup on channels which are arbitrary integers. One thread calls sleep, another one is started on wakeup
no need to assign a 1-bit location for the variable separately. Storing this integer itself (versus not storing it) is enough to indicate state.
xv6 parent and child process
OS calls "sleep" on PID when parent spawns a child process, when child exits OS calls "wakeup" on the same PID so the parent process wakes up again
xv6 kill
implemented in proc.c
https://github.com/mit-pdos/xv6-public/blob/master/proc.c
example of a place where killed bit is checked (this example is inside trap)
https://github.com/mit-pdos/xv6-public/blob/master/trap.c
linux uses signals to implement kill, however xv6 does not implement signals
goes through proc table, finds entry and sets killed bit to 1
if process is already sleeping, then mark it as RUNNABLE. so it is eventually started and then it is killed.
that process will then respond to this entry, and exit.
timer interrupt is used to periodically force all processes into kernel mode and check if they are killed or not.
killed bit is checked when entering kernel mode or when exiting kernel mode. checking at this time (not other times) ensures no other work is interrupted, and consistency of data structures is maintained
killed bit also checked before user process is ever going to sleep.
after exiting, its parent will free up memory
if a parent is sleeping (because its children are running) and then kill is called on its pid, then it will first be woken up (despite sleep condition not matching) and killed bit will be set, and then after that it'll check its killed bit and call exit
IDE Disk driver
https://github.com/mit-pdos/xv6-public/blob/master/ide.c
disk driver - when read() or write() system call is made, disk driver code actually interfaces with the hardware
interrupt-based
idequeue protected by idelock - all syscalls are first stored in the queue. queue is processed FIFO by disk driver.
buffer cache - all read/write data is stored in disk cache shared across all threads and implemented by kernel. repeated reads can be satisfied from it.
iderw(): acquire idelock, append buffer to end of idequeue, if this is only element in idequeue then start disk, wait for request to finish, release idelock
ideintr(): acquire idelock, if idequeue is null then return, read data, move to next position in queue, wakeup() call used to coordinate between disk and CPU, if idequeue still non-empty then restart disk device for next buffer, release idelock
interrupt handlers are disabled while inside these functions, because lock is being held
Demand paging
when a program is started, don't need to load entire program from disk to memory
load pages on demand, as needed. pages are left pointing to disk blocks, and can retrieve from disk as needed
page fault occurs if memory access is attempted, then instruction is restarted
precise exceptions - need to ensure if memory access fails, then no changes have been made and it is okay to restart the instruction
demand paging allows loading of programs larger than size of RAM
swap space on disk is often used to store extra data on disk, with pages pointing to it
in practice hit rates are often >99.9% which makes this worth doing
commonly used: 1 block = 1 page
"Fully associative cache"
"Write back" cache - every write shouldnt necessarily hit the disk
Optimal page replacement policy - replace page not used for longest time in the future. This is not possible to do unless the programmer tells you which page this is.
LRU policy - replace page not used for longest time in the past. This is a heuristic that tries to approximate policy above.
Global replacement
One attack possible - a process can touch a lot of pages and reduce performance of other processes
Per-page policy
If a process is hitting many page faults, then reduce quota for that process
Thrashing
most page accesses are page faults and hence accessing the disk. This significantly slows down the machine.
OS can split processes into groups and run only some at a time, such that memory sum of active processes at any point in time is smaller than RAM
this does not work if one process is single handedly causing thrashing, only options are to give it a smaller fraction of compute time, or kill it
Handle thrashing used elapsed time
Working set = set of pages accessed in last T seconds
time elapsed of each page tracked, it is only updated on missed path not hit path, hence disk access time is greater and time to compute time elapsed in not large
Page fault frequency
Find page fault frequency of each process and allocate fractions of memory based on it
Thrashing was a bigger problem when computers were shared. On personal computers, user can manually control which processes to run or not run.
mmap (memory mapped files)
just as extra memory pages can be stored on disk, files from disk can also be stored in memory
flush - explicitly force latest contents to be written to disk
Samuel:
lmao its caches all the way down:
L1 cache caches a subset of RAM
RAM caches a subset of disk
disk caches a subset of all disks on Earth
all disks on Earth cache a subset of all human-produced info
I'm proposing:
for all non-video data on Earth, no need for caching and just put everything in RAM
Storage devices
magnetic disk
typical as of 2018:
1-5 platters, 1-8 inches radius, 200k circular tracks per radial inch, ?? sectors per track, 512 bytes per sector, 100 GB - 2 TB total capacity
5k-15k rpm, seek time 2-10 ms, avg rotational latency 4 ms (assuming 7500 rpm), 100-150 MB/s transfer rate
direct memory access: CPU can ask a large number of sectors to be moved from disk to memory without that data all passing through the CPU's disk registers
flash
no moving parts, shock-resistant
as of 2018:
5-10x more expensive than magnetic disk
16-256 KB per block, 512B-4KB per page, 16 GB total capacity
updating a page is not possible, need to update an entire block at once
wear-out - after 100k writes on a block, not reliable
magnetic tape
commonly used for backup till 1990s, much cheaper than disks, not as popular now
typical:
9 bits (1 byte + parity bit) tape cross-section, 0.5 inch cross-section, 2400 feet length, total transfer rate ~10 MB/s
reading any address requires reading entire tape from start till that address
DIsk controller - order of serving I/O requests
FIFO
problem: this does not take advantage of the fact that some regions are close to each other and hence could be done together
SSTS - shortest seek time first
problem: starvation - some request may stay in queue for extremely long because disk controller remains on a different region on disk
Scan algorithm
pick direction of radial movement (towards or away from centre) based on FIFO request, but once picked it'll keep that direction and only do SSTS along the same direction
Database versus filesystem
Database bypasses filesystem
Database knows more about the structure of the data, and can optimise seek and rotation based on the knowledge of data structure it has
Filesystem assumes no information about data structure, uses heuristics that work well for all data structures
Filesystem
Most UNIX filesystem optimisations assume magnetic disk
INODE represents on-disk file
user can't access INODE, can inspect using stat(fd), otherwise user works with file descriptor not INODE - read(fd) and write(fd)
implicit offset: read(fd, buf, size) - fd contains file reference and offset within the file
fork() means child process inherits fd and hence automatically inherits file offsets
if both processes write to same fd, then they automatically append to each other, as they share the offset, one process writing also shifts the offsets
Durability semantics
when write() called, flush to disk
when close() called, flush to disk - used for network filesystems sometimes
within T seconds, flush to disk - highest performance, used by linux
user application can call fsync() to force premature flushing to disk
need to ensure filesystem does not get corrupted if power is switched off at any time
Allocation of disk sectors to files
platonic ideal: contiguous organisation
each file is stored in contiguous region in disk, INODE points to first block of file
sequential access is fast
problem: disk fragmentation, appending to file is very slow if file end hits the start of another file
platonic ideal: linked list organisation
INODE points to first block of file, each block points to next block
avoids disk fragmentation problem
problem: random access is slow, need to follow a sequence of multiple pointers to get a specified offset in a file
File Allocation Table (FAT)
each INODE points to start of linked list, but store all linked lists in memory
FAT32: 2^28 entries * 2 KB block size = 512 GB
Indexed filesystem
Each INODE stores an index of block numbers
indexes can be multi-level. in a level-n index, some entries can point to another level-(n-1) index. typically upto 3-levels of indexing. larger files will use more levels of indexing.
In BSD UNIX: 14 direct entries, 1 entry to level-1 index, 1 entry to level-2 index. depending on file size, last and second last entry may or may not be used.
xv6 filesystem
xv6 blocks
block 0: boot sector
block 1: superblock
block 2 to n+1 (inclusive): inodes
block n+2 to n+m+1 (inclusive): free blocks bitmap - stores 1 bit for each block indicating if its free or occupied
block n+m+2 to end (inclusive): files
xv6 inode
type, nlink, size, addresses[12+1]
12 direct entries, 1 entry to block with 128 direct entries => (12 + 128) * 512B = 70 KB maximum file size (lmao)
directory: special file that stores directory and file entries in it
all blocks first go to buffer cache, before being read by CPU or memory
locking needed as multiple threads share the same disk, their operations should not collide
every block in buffer has a separate lock - BUSY bit
use global spinlock to protect all the fine-grained blocking locks
xv6 lock ordering to avoid deadlocks:
superblock < everything else < free blocks bitmap
inode < data
parent < child
Buffer cache replacement policy
maintain buffer cache as a double-linked list
order of list is order of last access time of underlying data blocks
maintaining this list takes less time that accessing underlying data blocks, hence it is feasible
xv6 filesystem optimisations that don't exist in it
write-through cache - BAD
all writes must hit disk, not just buffer cache
no prefetching - BAD
multiple reads/writes cannot be batched, disk will rotate more as a result
double copying - BAD
data is first copied to buffer cache, then copied to address referenced by user pointer. can't pass user pointer to disk controller directly
Disk buffer cache size
RAM acts as cache both for virtual memory and filesystem, needs to be split between both
Filesystem durability on system crash
maintain filesytem integrity if an operation gets partially completed and power to machine is cut (RAM gets wiped)
old solution: ordering
force order of write operations such that dangling pointer does not happen but disk block leak can happen
i.e. if A points to B, first write B then write A. interruption means B alone allocated is possible but A alone allocated is not possible
on next reboot, fsck can check entire filesystem for inconsistencies. this was fine for older disks but slow for today's disks (100 MB/s for 1 TB disk is slow)
old solution: write-back cache + ordering:
write-back cache (write to cache, only write to disk later) is much faster than write-through (all writes hit disk immediately). we can combine this with ordering, by also storing the order in the cache
whenever a new operation causes order different from existing ordering in disk buffer cache, then first flush the disk buffer before trying to execute new instruction
modern solution: logging
maintain a log on disk of all operations, write "commit" to log as last write after all other reads/writes for that filesystem operation are completed, invariant states that operations have only completed if "commit" was logged
problem: 2x disk writes per filesystem operation
Linux ext3 filesystem solves this
ext3 filesystem
Concepts
combines many filesystem operations into one transaction
all of them get committed to disk log together, let's say every 30 seconds
multiple writes to same blocks get batched, hence no 2x disk writes for them
one transaction at a time
file structure (INODEs etc) may be inconsistent, {file structure+log} as combined system is never consistent, log is ground truth and can be used to undo changes in file structure
log is a circular buffer with head pointer stored in as offset in log superblock, and tail pointer inferrable using sequence numbers
all operations in one transaction must finish before operations in next transaction are started. this delay is non-trivial but acceptable.
Blocks
log superblock - start offset, sequence number
sequence number is a counter of number of log entries made so far
descriptor blocks - block number, sequence number, magic number
data blocks
commit blocks - sequence number, magic number
Recovery algo
from superblock: get offset, sequence number
scan starting at offset till first record with incorrect sequence number or incorrect magic number
then scan backward till first commit record
this range of blocks has not been completed successfully. complete them and commit to log
Modes
Ordered mode - data blocks not logged, only metadata logged
Journaled mode - data blocks also logged
Performance
much higher than xv6 filesytem, because all data blocks in tx are written sequentially to log, and then next disk revolution all commit blocks in tx are written sequentially to log. assuming magnetic disk, sequential is much faster than random access
ext4 filesystem
Solves problem: what if disk has an internal cache, and the disk cannot honestly report whether a write has actually hit the magntiec platter or only just the disk cache. ext3 does not solve this problem
Also stores checksum (of what???)
Ensures consistency of descriptor and commit blocks stored in log, doesn't ensure consistency of data blocks stored in log
fsync() syscall
forces disk buffers to be written to disk
ordered mode: commits latest tx, then syscall returns
data might not actually be on disk yet, if power crashes, it will be written to disk on next power on from the log
journaled mode: commits latest tx, also writes data blocks to log, then syscall returns
Operating System Security
authentication (who is request an action?), authorization (what action can they do?), access enforcement (actually allowing/blocking the action)
Authentication
typically either passwords (hashed list of passwords stored) or trusted devices
Authorization
typically a matrix of objects and users, for each combination of object and user there's a set of permissions allowed
typically entire matrix is not stored at same place
access control list: each object is stored with list of users and permissions of each user. organise users into groups to shorten list length
"capabilities": ACLs make namespace public. "capabilities" can be used to make namespaces private, only authorized user or process is aware of name of resource
UNIX filesystem ACLs
users: owner, group, others
permissions: read, write, execute
ACL: 9 bits per file - indicates exactly one of three users above and one of three permissions above
also: setuid bit per file - indicates whether file can be executed with filesystem privileges of owner not the user that is executing it. set to 0 by default (owner's filesystem privileges not granted to executor)
if setuid is 1, then process has two uids (uid and ruid) and can switch between them at will
Process scheduling
Objectives: Maximise throughput, minimise response time, no process starvation
Platonic ideals of scheduling CPU time
FIFO
Shortest job first
Round-robin
Every scheduling interval, cycle to next process out of n active processes
10 ms interval is commonly used as this is faster than human response time
Proportional round-robin
CPU time assigned to job proportional to job's CPU requirement
Proportional round-robin using lottery
Every time interval, pick random job with probability weights based on job's CPU requirement
Priority scheduling
If multiple processes are ready, assign CPU to job with higher priority rank
problem: priority inversion: higher priority process might be waiting for a lock from a lower priority process, some intermediate process might run indefinitely and the lock may never be released
solution: priority donation: increase priority of the process that has lock to that of the process waiting for it
In practice, multi-level feedback queue is common when scheduling CPU time
Multiple priority levels, round robin among processes at each priority level
Typically I/O-heavy processes are given higher priority, compute-heavy processes are given lower priority
Heuristic assumes processes that were previously I/O-heavy or compute-heavy will remain so in future
Typically higher priority I/O-heavy processes are given smaller time interval
Processes that are not touched for long duration have their priority automatically increased - prevents starvation
In practice, affinity scheduling on multi-core machine
Each CPU core has a separate cache, once a cache is hot in one core, prefer processes to run on the same core where their cache is warm
In practice, gang schedules when inter-process communication exists
Execute both processes together as a gang
But OS may not always know which processes are communicating
Example where Unix knows: two processes communicating using an OS pipe
Example where Unix does not know: process is in spinlock
developer can only partially safeguard against this by ensuring critical section inside spinlock takes less time, less syscalls, ensure less chances of page faults or interrupts
Example where Unix does not know: two processes are in producer-consumer relationship in user space
In general: Resources to be scheduled across multiple processes, not just CPU time
Multiple CPU cores
CPU time
Memory
Memory cache
Disk
Network
Scheduling was more important when computers were shared between users, today they're less important, in future with cloud computing again scheduling is becoming important
PAUSED AT END OF LECTURE 37
AI generated notes on some xv6 memory management lecture
o BOOTING AND INITIAL PHYSICAL MAPPING
When the system is first powered on, it starts in 16-bit real mode.
o Physical memory is directly accessed with no translation if an address is below the available RAM (M).
o If an address exceeds M, the hardware raises an error.
In xv6 (the teaching OS), segmentation is initially set up trivially (segment registers set to zero base and maximum limit) so that the entire address range (0 to M) is still effectively flat.
o SWITCHING TO 32-BIT MODE WITH SEGMENTATION
The system transitions to 32-bit mode, enabling segmentation.
Although segmentation is "on," xv6 sets segment base = 0 and limit = 2^32 - 1 for a flat model:
o A virtual address X still effectively refers to physical address X (for addresses below M).
o Makes it easier to move on to the next step, paging.
o BOOT SECTOR ROLE
The boot sector (first 512 bytes on disk) loads the kernel into memory.
o It knows how many sectors of the kernel to read and places them in physical memory starting around 1 MB.
o Still no paging at this point, just a flat, segment-based address space from 0 to M.
The kernel is compiled such that its symbols have virtual addresses above 2 GB (the "kernel base").
o Accessing these addresses is not yet valid in the current address space (0 to M), so the kernel cannot use them before paging is set up.
o ENABLING PAGING (ENTRY.S)
After the kernel is loaded at ~1 MB, control transfers to entry.s (the first assembly instructions).
o Before paging is turned on, the code must not dereference kernel symbols (which reside above 2 GB).
entry.s sets up a preliminary page directory ("entry page directory") that:
o Maps 0-4 MB of virtual addresses to 0-4 MB of physical memory (identity map).
o Also maps kern-base (2 GB) to (2 GB + 4 MB) onto the same 0-4 MB physical region.
o This ensures two virtual address ranges--one low (0-4 MB) and one high (2 GB to 2 GB+4 MB)--point to the same physical space.
Paging is enabled (CR3 is loaded with the physical address of the entry page directory, and the paging bit is set).
o The kernel code at 1 MB still runs without crashing, because 1 MB is inside the identity-mapped 0-4 MB.
o Once paging is on, addresses in the high region (above 2 GB) also map to the same physical memory.
o TRANSITIONING THE STACK AND EXECUTION TO "HIGH" ADDRESSES
After paging is enabled, the assembly in entry.s carefully sets up ESP to point to a stack symbol (stack + kstack_size).
o That stack symbol has a virtual address above 2 GB.
o Because paging is now active, that virtual address is correctly mapped to physical memory.
The code then jumps to main(), which also has a virtual address above 2 GB.
o Execution proceeds entirely in this higher half of the address space.
o The kernel base is thus fully utilized, and kernel symbols are all valid addresses now.
o MAIN FUNCTION AND THE KERNEL PAGE TABLES
main() is the first C function running in the fully paged environment.
It calls kinit1() to initialize a physical page allocator (kalloc / kfree).
o This manages physical memory pages that the kernel can request.
Then main() calls kvmalloc() to create a more complete kernel page directory:
o Replaces the simple "entry page directory" (0-4 MB + 2 GB-2 GB+4 MB mapping).
o Maps the entire physical memory (0->M) into the high kernel address range (2 GB->2 GB+M).
o Also includes device space and other necessary mappings.
o MAPPING THE ENTIRE PHYSICAL ADDRESS SPACE
The new page directory will:
o Map 2 GB-(2 GB + M) to 0-M of physical memory.
o Free up the low memory region (0-4 MB) so that it can eventually be used for user processes.
o Reserve special areas (like the top region near 0xFE000000) for device mappings.
o KVMALLOC AND SETUPKVM
kvmalloc() does two main jobs:
setupkvm(): Allocates a new page directory and sets up the kernel mappings (code, data, device regions).
Zeros out the page directory to clear out any existing entries.
Fills it through map_pages() calls, which create small-page mappings (4 KB granularity) for each region.
Mappings include:
? Kernel code (read/execute) and data (read/write).
? All free physical memory as a potential kernel heap.
? Device space (e.g., FE000000-FFFFFFFF) for memory-mapped I/O.
switchkvm(): Switches CR3 to this new page directory (involves translating the new page directory's virtual pointer to a physical address).
The kernel now runs with a fully mapped higher-half view of all physical memory.
o KALLOC, KFREE, AND FURTHER MEMORY USE
kalloc allocates one physical page at a time from the free list (the region above the kernel code), returning a virtual address in kernel space.
If the kernel or user processes need more memory:
o The kernel can create new mappings by pairing a newly allocated physical page with the desired virtual address range.
o The user's virtual pages are separate from the kernel, but they physically come from the same free pool.
o The kernel also always retains a mapping of the entire physical space at or above 2 GB.
o NOTES ON PAGE SIZES AND OVERHEAD
xv6 uses 4 KB pages for all mappings, including large contiguous areas.
Some OSes optimize by using large pages (e.g., 4 MB pages) for kernel space, but xv6 keeps it simple.
o MEMORY-MAPPED DEVICES
Certain physical addresses (e.g., the bottom 1 MB, the top region around 0xFE000000) are reserved for devices rather than RAM.
The kernel sets up identical mappings (VA = PA) in that top region so it can easily access device registers.
o WRAPPING UP
The lecture walks through how xv6 gradually transitions from real-mode flat memory to segmentation-based 32-bit flat addressing, and finally into a fully paged environment where:
o The kernel resides and operates at high addresses (above 2 GB).
o All physical memory is visible to the kernel in that high range.
o The low virtual memory range can be repurposed for user processes, each in their own page tables.
This setup is typical of many x86 OS designs, although details vary (e.g., the use of large pages, or different base addresses for the kernel).
livelock
CPU doing lot of work but no useful work, hence system is stuck
example: assume a queue or buffer with producer and consumer, if producer is much faster than consumer, the buffer will get completely full, new data to queue will get dropped, however producer process may still be hogging CPU and not letting consumer work.
Specific example: 1 gbps network input but CPU can process only 100 Mbps at a time (also we are not writing the data to disk)
Solution implemented in linux kernel - if buffer size is high, switch from interrupt-mode for producer to polling-mode and select an appropriate polling frequency. this ensures not all requests from producer are responded to immediately.
cache coherence protocol
hardware implementation to ensure each CPU cache can store value of a memory address in read mode or read-write mode ("exclusive mode"). multiple CPUs can cache same address in read mode but only one CPU can cache it in exclusive mode.
locks can end up stored in CPU caches however if multiple CPUs try to write the same lock then program gets slowed down to the speed of memory bus, not CPU cache
Lock-free sync using compare-and-swap
You can do compare-and-swap on writer threads and use no lock on reader threads
compare-and-swap is hardware implementation that atomically does swap only if a value has not been changed
Non-preemptibility in Linux
In multithreaded process, one thread might be aware that a certain memory location is now garbage and free() is to be called. However other threads might not be aware of this yet, and might also be holding a pointer to the same location.
Linux provides an invariant such that a thread running in kernel mode can choose to be non-preemptible within a critical section of code, so it can ensure it completes its operations before other thread is context switched to.
PAUSED AT LECTURE 39 16:52
Subscribe
Enter email or phone number to subscribe. You will receive atmost one update per month