System Calls | Return Code | Current Offset |
---|---|---|
fd = open("file", O_RDONLY); | 3 | 0 |
read(fd, buffer, 100); | 100 | 100 |
read(fd, buffer, 100); | 100 | 200 |
read(fd, buffer, 100); | 100 | 300 |
read(fd, buffer, 100); | 0 | 300 |
close (fd); | 0 |
System Calls | Return Code | OFT[10] Current Offset | OFT[11] Current Offset |
---|---|---|---|
fd1 = open("file", O_RDONLY); | 3 | 0 | |
fd2 = open("file", 0_RDONLY); | 4 | 0 | 0 |
read(fd1, buffer1, 100); | 100 | 100 | 0 |
read(fd2, buffer2, 100); | 100 | 100 | 100 |
close (fd1); | 0 | 100 | |
close (fd2); | 0 |
System Calls | Return Code | Current Offset |
---|---|---|
fd = open("file", 0_RDONLY); | 3 | 0 |
lseek(fd, 200, SEEK_SET); | 200 | 200 |
read(fd, buffer, 50); | 50 | 250 |
close (fd); | 0 |
struct stat\{ | ||
dev_t | st_dev; | ' ID of device containing file |
ino_t | st_ino; | 11inode number |
mode_t | st_mode; | 11protection |
nlink_t | st_nlink; | / number of hard links |
uid_t | st_uid; | / user ID of owner |
gid_t | st_gid; | 11group ID of owner |
dev_t | st_rdev; | 11device ID (if special file) |
off_t | st_size; | '/ total size, in bytes |
blksize_t | st_blksize; | blocksize for filesystem I/O |
blkcnt_t | st_blocks; | // number of blocks allocated |
time_t | st_atime; | // time of last access |
time_t | st_mtime; | // time of last modification |
time_t | st_ctime; | // time of last status change |
char | d_name[256]; | 11filename |
ino_t | d_ino; | 11inode number |
off_t | d_off; | // offset to the next dirent |
unsignedshort | d_reclen; | // length of this record |
unsignedchar | d_type; | 11type of file |
prompt> echo hello > file prompt> stat file | ||
Inode: 67158084 prompt> ln file file2 prompt> stat file | Links: | 1: |
... Inode: 67158084 prompt> stat file2 | Links: 2 | ... |
.. Inode: 67158084 prompt> ln file2 file3 prompt> stat file | Links: 2 | ... |
Inode: 67158084 prompt> rm file prompt> stat file2 | Links: 3 | : |
Inode: 67158084 prompt> rm file2 prompt> stat file3 | Links: 2 | : |
Inode: 67158084 prompt> rm file3 | Links: 1 | ... |
drwxr-x-- | 2 | remzi | remzi | 29 | May | 3 19:10 | . / |
drwxr-x-- | 27 | remzi | remzi | 4096 | May | 3 15:14 | ..1 |
-rw-r----- | 1 | remzi | remzi | 6 | May | 3 19:10 | file |
lrwxrwxrwx | 1 | remzi | remzi | 4 | May | 3 19:10 | file2->file |
Size | Name | What is this inode field for? |
---|---|---|
2 | mode | can this file be read/written/executed? |
2 | uid | who owns this file? |
4 | size | how many bytes are in this file? |
4 | time | what time was this file last accessed? |
4 | ctime | what time was this file created? |
4 | mtime | what time was this file last modified? |
4 | dtime | what time was this inode deleted? |
2 | gid | which group does this file belong to? |
2 | links_count | how many hard links are there to this file? |
4 | blocks | how many blocks have been allocated to this file? |
4 | flags | how should ext2 use this inode? |
4 | osd1 | an OS-dependent field |
60 | block | a set of disk pointers (15 total) |
4 | generation | file version (used by NFS) |
4 | file_acl | a new permissions model beyond mode bits |
4 | dir_acl | called access control lists |
Most files are small | ~2K is the most common size |
Average file size is growing | Almost 200K is the average |
Most bytes are stored in large files | A few big files use most of space |
File systems contain lots of files | Almost 100K on average |
File systems are roughly half full | Even as disks grow, file systems remain ~50% full |
Directories are typically small | Many have few entries; most have 20 or fewer |
inum | reclen | strlen | name |
---|---|---|---|
5 | 12 | 2 | . |
2 | 12 | 3 | - |
12 | 12 | 4 | foo |
13 | 12 | 4 | bar |
24 | 36 | 28 | foob: |
data inode bitmapbitmap | root foo bar inodeinodeinode | root foo bar bar bar data data data data data [0][1][2] | |
---|---|---|---|
open(bar) | read read read | read read | |
read() | read write | read | |
read() | read write | read | |
read() | read write | read |
data inode bitmapbitmap | root foo bar inodeinodeinode | root foo bar data data data [0] | bar data [2] | |
---|---|---|---|---|
create (/foo/bar) | read write | read read read write write | read read write | |
write() | read write | read write | write | |
write() | read write | read write | ||
write() | read write | read write | write |
ib db | Inodes | Data |
Super | Group 0 | Group 1 | Group N |
---|
Super | Journal | Group 0 | Group 1 | ... | Group N |
jewinor | TxB id=1 | I[v2] | B[v2] | ?? | TxE id=1 | ![]() |
jewinor | Tx1 | Tx2 | Tx3 | Tx4 | Tx5 | ![]() |
jewinor | TxB id=1 | I[foo] ptr:1000 | D[foo] [final addr:1000] | TxE id=1 | ![]() |
jewnop | TxB id=1 | I[foo] ptr:1000 | D[foo] [final addr:1000] | TxE TxB id=1id=2 | l[bar] ptr:1000 | TxE id=2 |
Page 0 | Page 1 | Page 2 | Page 3 |
---|---|---|---|
00011000 | 11001110 | 00000001 | 00111111 |
VALID | VALID | VALID | VALID |
Page 0 | Page 1 | Page 2 | Page 3 |
---|---|---|---|
111111111 | 11111111 | 11111111 | 111111111 |
ERASED | ERASED | ERASED | ERASED |
Page 0 | Page 1 | Page 2 | Page 3 |
---|---|---|---|
00000011 | 111111111 | 111111111 | 11111111 |
VALID | ERASED | ERASED | ERASED |
Device | Read ( | Program (μs) | Erase ( |
---|---|---|---|
SLC | 25 | 200-300 | 1500-2000 |
MLC | 50 | 600-900 | ~3000 |
TLC | ~75 | ~900-1350 | ~4500 |
Table: | 100 | 101 | Memory | ||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
Block: | 0 | 1 | 2 | Flash Chip | |||||||
Page: | 0001 | 02 | 03 | 04 | 05 | 06 | 08 | 09 | |||
Content: | a1a2 | b1 | b2 | c1 | c2 | ||||||
State: | VV | V | V | V | V | E | E | i | i | 1 |
Table: | 100 | →4 101 →5 2000→6 2001→7 | Memory | |||
---|---|---|---|---|---|---|
Block: | 0 | Flash Chip | ||||
Page: | 00 | 01 | 02 | 03 | ||
Content: | ||||||
State: | E | E | E | E |
Log Table: Data Table: | Memory | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
Block: Page: | 0 000102 | 03 | 04 | 1 05 | 06 | 07 | 08 | 2 09 | 10 | 11 | Flash Chip |
Content: | a'b'c' | d’ | a | b | C | d | |||||
State: | VVV | V | i | i | i | i | V | V | V | V |
Device | Random | Sequential | ||
---|---|---|---|---|
Reads (MB/s) | Writes (MB/s) | Reads (MB/s) | Writes (MB/s) | |
Samsung 840 Pro SSD | 103 | 287 | 421 | 384 |
Seagate 600 SSD | 84 | 252 | 424 | 374 |
Intel SSD 335 SSD | 39 | 222 | 344 | 354 |
Seagate Savvio 15K.3 HDD | 2 | 2 | 223 | 223 |
0011 | 0110 | 0101 | 1110 | 1100 | 0100 | 1100 | 1101 |
1011 | 1010 | 0001 | 0100 | 1000 | 1010 | 1001 | 0010 |
1110 | 1100 | 1110 | 1111 | 0010 | 1100 | 0011 | 1010 |
0100 | O O O O | 1011 | 1110 | 1111 | 0110 | 0110 | 0110 |
D0 | D1 | D2 | D3 | D4 | D5 | D6 |
Open | D0 | OFF | D1 | CO2 | D2 | 0.192 | D3 | COMP | D4 |
OpeOFCONOpeOptio | D0 | D1 | D2 | D3 | D4 |
Disk 1 | Open | D0 | Open | D1 | CO2 | D2 | ||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
Disk 0 | 0.00 | D0 | OFF | D1 | CO2 | D2 |
NFSPROC_GETATTR | file handle returns: attributes |
NFSPROC_SETATTR | file handle, attributes returns: attributes |
NFSPROC_LOOKUP | directory file handle, name of file/dir to look up returns: file handle, attributes |
NFSPROC_READ | file handle, offset, count data, attributes |
NFSPROC_WRITE | file handle, offset, count, data attributes |
NFSPROC_CREATE | directory file handle, name of file, attributes file handle, attributes |
NFSPROC_REMOVE | directory file handle, name of file to be removed |
NFSPROC_MKDIR | directory file handle, name of directory, attributes file handle, attributes |
NFSPROC_RMDIR | directory file handle, name of directory to be removed |
NFSPROC_READDIR | directory handle, count of bytes to read, cookie returns: directory entries, cookie (to get more entries) |
Client | Server |
---|---|
fd = open("/foo",...); Send LOOKUP (rootdir FH, "foo") | Receive LOOKUP request look for "foo" in root dir return foo's FH + attributes |
Receive LOOKUP reply allocate file desc in open file table store foo's FH in table store current file position (0) return file descriptor to application | |
read(fd, buffer, MAX); Index into open file table with fd get NFS file handle (FH) use current file position as offset Send READ (FH, offset=0, count=MAX) | |
Receive READ request use FH to get volume/inode num read inode from disk (or cache) compute block location (using offset) read data from disk (or cache) return data to client | |
Receive READ reply update file position (+bytes read) set current file position = MAX return data/error code to app | |
close(fd); | |
Just need to clean up local structures Free descriptor "fd" in open file table (No need to talk to server) | |
Figure 49.5: Reading A File: Client-side And File Server Actions |
Comments | |||||
---|---|---|---|---|---|
Cache | Client2 | Cache | Server Disk | ||
open(F) | - | - | - | File created | |
write(A) | A | - | - | ||
close() | A | - | A | ||
open(F) | A | - | A | ||
A | A | ||||
close() | A | - | A | ||
open(F) | A | - | A | ||
write(B) | B | - | A | ||
open(F) | B | - | A | Local processes | |
B | A | see writes immediately | |||
close() | B | - | A | ||
B | open(F) | A | A | Remote processes do not see writes... | |
B | A | A | |||
B | close() | A | A | ||
close() | B | A | B | ... until close() has taken place | |
B | open(F) | B | B | ||
B | B | B | |||
B | close() | B | B | ||
B | open(F) | B | B | ||
open(F) | B | B | B | ||
write(D) | D | B | B | ||
D | write(C) | C | B | ||
D | close() | C | C | ||
close() | D | C | D | ||
D | open(F) | D | D | Unfortunately for | |
D | D | D | |||
D | close() | D | D | ||
Workload | NFS | AFS | NFS |
---|---|---|---|
1. Small file, sequential read | 1 | ||
2. Small file, sequential re-read | 1 | ||
3. Medium file, sequential read | 1 | ||
4. Medium file, sequential re-read | 1 | ||
5. Large file, sequential read | 1 | ||
6. Large file, sequential re-read | |||
7. Large file, single read | |||
8. Small file, sequential write | 1 | ||
9. Large file, sequential write | 1 | ||
10. Large file, sequential overwrite | 2 | ||
11. Large file, single write |
Process | Hardware | Operating System |
---|---|---|
1. Execute instructions | ||
3. Switch to kernel mode; Jump to trap handler | ||
4. In kernel mode; Handle system call; Return from trap | ||
5. Switch to user mode; Return to user code | ||
6. Resume execution (@PC after trap) |
Process | Operating System |
---|---|
1. System call: Trap to OS | |
2. OS trap handler: | |
Decode trap and execute | |
appropriate syscall routine; | |
When done: return from trap | |
3. Resume execution (@PC after trap) |
Process | Operating System | VMM |
---|---|---|
1. System call: Trap to OS | ||
2. Process trapped: Call OS trap handler (at reduced privilege) | ||
3. OS trap handler: Decode trap and execute syscall; When done: issue return-from-trap | ||
4. OS tried return from trap: Do real return from trap |
Process | Operating System |
---|---|
1. Load from memory: TLB miss: Trap | |
2. OS TLB miss handler: Extract VPN from VA; Do page table lookup; If present and valid: get PFN, update TLB; Return from trap | |
3. Resume execution | |
(@PC of trapping instruction); | |
Instruction is retried; | |
Results in TLB hit |
0 | C0 | 0 | ||
1 | C1 | Con1 | 0 | Con1 waiting on full |
2 | Con1 | 0 | switch: Con1 to Prod | |
3 | P0 | Con1 | 0 | |
4 | P2 | Con1 | 0 | Prod doesn't wait (FE=0) |
5 | P3 | Con1 | 0 | |
6 | P4 | Con1 | 1 | Prod updates fullEntries |
7 | P5 | 1 | Prod signals: Con1 now ready | |
8 | 1 | switch: Prod to Con2 | ||
9 | C0 | 1 | switch to Con2 | |
10 | C2 | 1 | Con2 doesn’t wait (FE=1) | |
11 | C3 | 1 | ||
12 | C4 | 0 | Con2 changes fullEntries | |
13 | C5 | 0 | Con2 signals empty (no waiter) | |
14 | C6 | 0 | Con2 done | |
15 | 0 | switch: Con2 to Con1 | ||
16 | C0 | 0 | recheck fullEntries: 0! | |
17 | C1 | Con1 | 0 | wait on full again |