文件系統是操作系統里非常重要的一個子系統。虛擬文件系統,顧名思義。它為應用程序員提供一層抽象,屏蔽底層各種文件系統的差異。Linux的文件系統采用面向對象的方式設計,這使得Linux的文件系統非常容易擴展,我們可以非常容易將一個新的文件系統添加到Linux中。
Linux本身主要是C語言編寫的(少量匯編),而大家都知道C語言是典型的結構化語言,不屬於面向對象語言,那為什么又說Linux的文件系統采用面向對象設計呢?(從這里可以看出,面向對象設計和面向對象編程語言其實關系並不大,殊不知很多人使用者Java之類的面向對象語言卻編寫着結構化的代碼,也有人用C之類的結構化語言做出精妙的面向對象設計)。這在后面會講解。
文件系統類型
VFS所支持的文件系統類型可以歸結為以下三大類:
- 基於磁盤的文件系統(Ext2, Ext3等)
- 網絡文件系統(NFS等)
- 特殊文件系統(proc, sysfs)
Linux的目錄形成一個樹形結構,根目錄是 /。根目錄位於根文件系統中。在Linux中通常是Ext2或Ext3。其他文件系統掛載在根文件系統的子目錄下。
通用文件模型
可以說,Linux VFS等強大擴展能力,正是因為這種通用文件模型的設計。新支持的文件系統,只需要將自己的結構轉換成這種通用模型即可插入到Linux中。首先我們來看看VFS中幾個主要的對象結構:
super_block
位於<linux/fs.h>中
struct super_block {
struct list_head s_list; /* Keep this first */
dev_t s_dev; /* search index; _not_ kdev_t */
unsigned char s_blocksize_bits;
unsigned long s_blocksize;
loff_t s_maxbytes; /* Max file size */
struct file_system_type *s_type;
const struct super_operations *s_op;
const struct dquot_operations *dq_op;
const struct quotactl_ops *s_qcop;
const struct export_operations *s_export_op;
unsigned long s_flags;
unsigned long s_magic;
struct dentry *s_root;
struct rw_semaphore s_umount;
int s_count;
atomic_t s_active;
#ifdef CONFIG_SECURITY
void *s_security;
#endif
const struct xattr_handler **s_xattr;
struct list_head s_inodes; /* all inodes */
struct hlist_bl_head s_anon; /* anonymous dentries for (nfs) exporting */
#ifdef CONFIG_SMP
struct list_head __percpu *s_files;
#else
struct list_head s_files;
#endif
struct list_head s_mounts; /* list of mounts; _not_ for fs use */
/* s_dentry_lru, s_nr_dentry_unused protected by dcache.c lru locks */
struct list_head s_dentry_lru; /* unused dentry lru */
int s_nr_dentry_unused; /* # of dentry on lru */
/* s_inode_lru_lock protects s_inode_lru and s_nr_inodes_unused */
spinlock_t s_inode_lru_lock ____cacheline_aligned_in_smp;
struct list_head s_inode_lru; /* unused inode lru */
int s_nr_inodes_unused; /* # of inodes on lru */
struct block_device *s_bdev;
struct backing_dev_info *s_bdi;
struct mtd_info *s_mtd;
struct hlist_node s_instances;
struct quota_info s_dquot; /* Diskquota specific options */
struct sb_writers s_writers;
char s_id[32]; /* Informational name */
u8 s_uuid[16]; /* UUID */
void *s_fs_info; /* Filesystem private info */
unsigned int s_max_links;
fmode_t s_mode;
/* Granularity of c/m/atime in ns.
Cannot be worse than a second */
u32 s_time_gran;
/*
* The next field is for VFS *only*. No filesystems have any business
* even looking at it. You had been warned.
*/
struct mutex s_vfs_rename_mutex; /* Kludge */
/*
* Filesystem subtype. If non-empty the filesystem type field
* in /proc/mounts will be "type.subtype"
*/
char *s_subtype;
/*
* Saved mount options for lazy filesystems using
* generic_show_options()
*/
char __rcu *s_options;
const struct dentry_operations *s_d_op; /* default d_op for dentries */
/*
* Saved pool identifier for cleancache (-1 means none)
*/
int cleancache_poolid;
struct shrinker s_shrink; /* per-sb shrinker handle */
/* Number of inodes with nlink == 0 but still referenced */
atomic_long_t s_remove_count;
/* Being remounted read-only */
int s_readonly_remount;
};
super_block存儲對應的文件系統信息,對於基於磁盤的文件系統通常對應着存儲在磁盤上的文件系統控制塊(filesystem control block)。所有的super_block會放在一個循環雙鏈表中,鏈表的第一個元素是super_blocks(在<linux/fs.h>中)。struct list_head s_list指向前后鄰居。s_fs_info指向具體文件系統的superblock信息,這個一般保存在磁盤里,比如Ext2就會指向ext2_sb_info。
在這個結構里我們看到有一個s_op指針。這個指針指向的結構是具體的文件系統對應的super_block的操作。也就是說,當注冊一個新的文件系統時,新的文件系統會提供這些操作。這就是如何在C里實現面向對象的設計(雖然在C里可以做到面向對象設計,但並沒有真正的面向對象語言做的那么自然)。下面是struct super_operations結構:
struct super_operations {
struct inode *(*alloc_inode)(struct super_block *sb);
void (*destroy_inode)(struct inode *);
void (*dirty_inode) (struct inode *, int flags);
int (*write_inode) (struct inode *, struct writeback_control *wbc);
int (*drop_inode) (struct inode *);
void (*evict_inode) (struct inode *);
void (*put_super) (struct super_block *);
int (*sync_fs)(struct super_block *sb, int wait);
int (*freeze_fs) (struct super_block *);
int (*unfreeze_fs) (struct super_block *);
int (*statfs) (struct dentry *, struct kstatfs *);
int (*remount_fs) (struct super_block *, int *, char *);
void (*umount_begin) (struct super_block *);
int (*show_options)(struct seq_file *, struct dentry *);
int (*show_devname)(struct seq_file *, struct dentry *);
int (*show_path)(struct seq_file *, struct dentry *);
int (*show_stats)(struct seq_file *, struct dentry *);
#ifdef CONFIG_QUOTA
ssize_t (*quota_read)(struct super_block *, int, char *, size_t, loff_t);
ssize_t (*quota_write)(struct super_block *, int, const char *, size_t, loff_t);
#endif
int (*bdev_try_to_free_page)(struct super_block*, struct page*, gfp_t);
int (*nr_cached_objects)(struct super_block *);
void (*free_cached_objects)(struct super_block *, int);
};
比如我們要調用alloc_inode方法:
sb->s_op->alloc_inode(sb)。
這里與面向對象語言的差別是,面向對象語言里實例方法可以訪問到this,這樣就可以訪問到自身的所有成員,但是在C里卻做不到,所以需要將自身作為參數傳入到函數中。
inode
位於<linux/fs.h>中
struct inode {
umode_t i_mode;
unsigned short i_opflags;
kuid_t i_uid;
kgid_t i_gid;
unsigned int i_flags;
#ifdef CONFIG_FS_POSIX_ACL
struct posix_acl *i_acl;
struct posix_acl *i_default_acl;
#endif
const struct inode_operations *i_op;
struct super_block *i_sb;
struct address_space *i_mapping;
#ifdef CONFIG_SECURITY
void *i_security;
#endif
/* Stat data, not accessed from path walking */
unsigned long i_ino;
/*
* Filesystems may only read i_nlink directly. They shall use the
* following functions for modification:
*
* (set|clear|inc|drop)_nlink
* inode_(inc|dec)_link_count
*/
union {
const unsigned int i_nlink;
unsigned int __i_nlink;
};
dev_t i_rdev;
loff_t i_size;
struct timespec i_atime;
struct timespec i_mtime;
struct timespec i_ctime;
spinlock_t i_lock; /* i_blocks, i_bytes, maybe i_size */
unsigned short i_bytes;
unsigned int i_blkbits;
blkcnt_t i_blocks;
#ifdef __NEED_I_SIZE_ORDERED
seqcount_t i_size_seqcount;
#endif
/* Misc */
unsigned long i_state;
struct mutex i_mutex;
unsigned long dirtied_when; /* jiffies of first dirtying */
struct hlist_node i_hash;
struct list_head i_wb_list; /* backing dev IO list */
struct list_head i_lru; /* inode LRU list */
struct list_head i_sb_list;
union {
struct hlist_head i_dentry;
struct rcu_head i_rcu;
};
u64 i_version;
atomic_t i_count;
atomic_t i_dio_count;
atomic_t i_writecount;
const struct file_operations *i_fop; /* former ->i_op->default_file_ops */
struct file_lock *i_flock;
struct address_space i_data;
#ifdef CONFIG_QUOTA
struct dquot *i_dquot[MAXQUOTAS];
#endif
struct list_head i_devices;
union {
struct pipe_inode_info *i_pipe;
struct block_device *i_bdev;
struct cdev *i_cdev;
};
__u32 i_generation;
#ifdef CONFIG_FSNOTIFY
__u32 i_fsnotify_mask; /* all events this inode cares about */
struct hlist_head i_fsnotify_marks;
#endif
#ifdef CONFIG_IMA
atomic_t i_readcount; /* struct files open RO */
#endif
void *i_private; /* fs or device private pointer */
};
inode存儲的是特定文件的信息。對於基於磁盤的文件系統,這個通常對應着存儲在磁盤上的文件控制塊(file control block)。每個inode對應着一個inode編號,用來唯一標識文件系統中的文件。需要注意的是在Linux中文件夾也是一個文件,所以它也有一個inode對應着。
注意上面的結構,與super_block相比較,同樣也有一個i_op指針。它指向的是下面的結構:
struct inode_operations {
struct dentry * (*lookup) (struct inode *,struct dentry *, unsigned int);
void * (*follow_link) (struct dentry *, struct nameidata *);
int (*permission) (struct inode *, int);
struct posix_acl * (*get_acl)(struct inode *, int);
int (*readlink) (struct dentry *, char __user *,int);
void (*put_link) (struct dentry *, struct nameidata *, void *);
int (*create) (struct inode *,struct dentry *, umode_t, bool);
int (*link) (struct dentry *,struct inode *,struct dentry *);
int (*unlink) (struct inode *,struct dentry *);
int (*symlink) (struct inode *,struct dentry *,const char *);
int (*mkdir) (struct inode *,struct dentry *,umode_t);
int (*rmdir) (struct inode *,struct dentry *);
int (*mknod) (struct inode *,struct dentry *,umode_t,dev_t);
int (*rename) (struct inode *, struct dentry *,
struct inode *, struct dentry *);
void (*truncate) (struct inode *);
int (*setattr) (struct dentry *, struct iattr *);
int (*getattr) (struct vfsmount *mnt, struct dentry *, struct kstat *);
int (*setxattr) (struct dentry *, const char *,const void *,size_t,int);
ssize_t (*getxattr) (struct dentry *, const char *, void *, size_t);
ssize_t (*listxattr) (struct dentry *, char *, size_t);
int (*removexattr) (struct dentry *, const char *);
int (*fiemap)(struct inode *, struct fiemap_extent_info *, u64 start,
u64 len);
int (*update_time)(struct inode *, struct timespec *, int);
int (*atomic_open)(struct inode *, struct dentry *,
struct file *, unsigned open_flag,
umode_t create_mode, int *opened);
} ____cacheline_aligned;
這個操作上就有一些我們常用的文件操作命令,比如rename, mkdir等。這些命令通過系統調用最后都會映射到這里來。
file
位於<linux/fs.h>
struct file {
/*
* fu_list becomes invalid after file_free is called and queued via
* fu_rcuhead for RCU freeing
*/
union {
struct list_head fu_list;
struct rcu_head fu_rcuhead;
} f_u;
struct path f_path;
#define f_dentry f_path.dentry
#define f_vfsmnt f_path.mnt
const struct file_operations *f_op;
/*
* Protects f_ep_links, f_flags, f_pos vs i_size in lseek SEEK_CUR.
* Must not be taken from IRQ context.
*/
spinlock_t f_lock;
#ifdef CONFIG_SMP
int f_sb_list_cpu;
#endif
atomic_long_t f_count;
unsigned int f_flags;
fmode_t f_mode;
loff_t f_pos;
struct fown_struct f_owner;
const struct cred *f_cred;
struct file_ra_state f_ra;
u64 f_version;
#ifdef CONFIG_SECURITY
void *f_security;
#endif
/* needed for tty driver, and maybe others */
void *private_data;
#ifdef CONFIG_EPOLL
/* Used by fs/eventpoll.c to link all the hooks to this file */
struct list_head f_ep_links;
struct list_head f_tfile_llink;
#endif /* #ifdef CONFIG_EPOLL */
struct address_space *f_mapping;
#ifdef CONFIG_DEBUG_WRITECOUNT
unsigned long f_mnt_write_state;
#endif
};
file即表示一個進程內打開的文件,所以與之前的兩個結構不同,它是與進程對應的。對於這個結構來說,最重要的就是當前位置:f_pos。進程對文件的下一次操作就是從這個位置開始的。file對象會放在進程的文件描述符(file descriptions)表里。而我們使用read等函數時所使用的文件描述符即進程的文件描述符的索引。系統內部通過這個索引得到對應的file對象,然后使用對應的file_operations里的函數進行操作。
在task_struct結構(即用來表示進程描述符那個結構)里有一個struct files_struct *files:
struct files_struct {
/*
* read mostly part
*/
atomic_t count;
struct fdtable __rcu *fdt;
struct fdtable fdtab;
/*
* written part on a separate cache line in SMP
*/
spinlock_t file_lock ____cacheline_aligned_in_smp;
int next_fd;
unsigned long close_on_exec_init[1];
unsigned long open_fds_init[1];
struct file __rcu * fd_array[NR_OPEN_DEFAULT];
};
fdtable:
struct fdtable {
unsigned int max_fds;
struct file __rcu **fd; /* current fd array */
unsigned long *close_on_exec;
unsigned long *open_fds;
struct rcu_head rcu;
struct fdtable *next;
};
這個fdtable結構里的fd指向進程打開的文件,一般這個fd會指向files_struct的fd_array。fd_array的大小是NR_OPEN_DEFAULT,默認值是BITS_PER_LONG。在32位上默認是32。這里的策略是:如果進程打開的文件數小於等於32,fd指向fd_array,file對象也都是放在fd_array里,當打開的文件數超過32時,系統會開辟一個更大的數組,並更新fdtable的fd指針以及max_fds。我們程序里使用的那個文件描述符也就是這里數組的索引了。
值得注意的是有可能存在多個文件描述符指向同一個file。
另外task_struct還關聯到一個fs_struct結構上:
struct fs_struct {
int users;
spinlock_t lock;
seqcount_t seq;
int umask;
int in_exec;
struct path root, pwd;
};
struct path {
struct vfsmount *mnt;
struct dentry *dentry;
};
這個結構里也有幾個有意思的字段:root, pwd。root是進程的根目錄,pwd是進程的當前工作目錄(working directory)。還記得linux里的pwd命令不?很多人奇怪這個命令為什么叫這個名字呢?這個名字應該是跟密碼什么的相關的啊。這里就是原因了,pwd的意思猜測應該是process working directory。
這里的root和pwd有什么用呢?比如你寫了一個程序想訪問文件,如果你的路徑是以 / 開始的,則系統會從root開始查找該文件,如果沒有以 / 開始,則系統會從pwd開始查找文件。
file_operations
struct file_operations {
struct module *owner;
loff_t (*llseek) (struct file *, loff_t, int);
ssize_t (*read) (struct file *, char __user *, size_t, loff_t *);
ssize_t (*write) (struct file *, const char __user *, size_t, loff_t *);
ssize_t (*aio_read) (struct kiocb *, const struct iovec *, unsigned long, loff_t);
ssize_t (*aio_write) (struct kiocb *, const struct iovec *, unsigned long, loff_t);
int (*readdir) (struct file *, void *, filldir_t);
unsigned int (*poll) (struct file *, struct poll_table_struct *);
long (*unlocked_ioctl) (struct file *, unsigned int, unsigned long);
long (*compat_ioctl) (struct file *, unsigned int, unsigned long);
int (*mmap) (struct file *, struct vm_area_struct *);
int (*open) (struct inode *, struct file *);
int (*flush) (struct file *, fl_owner_t id);
int (*release) (struct inode *, struct file *);
int (*fsync) (struct file *, loff_t, loff_t, int datasync);
int (*aio_fsync) (struct kiocb *, int datasync);
int (*fasync) (int, struct file *, int);
int (*lock) (struct file *, int, struct file_lock *);
ssize_t (*sendpage) (struct file *, struct page *, int, size_t, loff_t *, int);
unsigned long (*get_unmapped_area)(struct file *, unsigned long, unsigned long, unsigned long, unsigned long);
int (*check_flags)(int);
int (*flock) (struct file *, int, struct file_lock *);
ssize_t (*splice_write)(struct pipe_inode_info *, struct file *, loff_t *, size_t, unsigned int);
ssize_t (*splice_read)(struct file *, loff_t *, struct pipe_inode_info *, size_t, unsigned int);
int (*setlease)(struct file *, long, struct file_lock **);
long (*fallocate)(struct file *file, int mode, loff_t offset,
loff_t len);
};
dentry
位於<linux/dcache.h>
struct dentry {
/* RCU lookup touched fields */
unsigned int d_flags; /* protected by d_lock */
seqcount_t d_seq; /* per dentry seqlock */
struct hlist_bl_node d_hash; /* lookup hash list */
struct dentry *d_parent; /* parent directory */
struct qstr d_name;
struct inode *d_inode; /* Where the name belongs to - NULL is
* negative */
unsigned char d_iname[DNAME_INLINE_LEN]; /* small names */
/* Ref lookup also touches following */
unsigned int d_count; /* protected by d_lock */
spinlock_t d_lock; /* per dentry lock */
const struct dentry_operations *d_op;
struct super_block *d_sb; /* The root of the dentry tree */
unsigned long d_time; /* used by d_revalidate */
void *d_fsdata; /* fs-specific data */
struct list_head d_lru; /* LRU list */
/*
* d_child and d_rcu can share memory
*/
union {
struct list_head d_child; /* child of parent list */
struct rcu_head d_rcu;
} d_u;
struct list_head d_subdirs; /* our children */
struct hlist_node d_alias; /* inode alias list */
};
dentry_operations
struct dentry_operations {
int (*d_revalidate)(struct dentry *, unsigned int);
int (*d_hash)(const struct dentry *, const struct inode *,
struct qstr *);
int (*d_compare)(const struct dentry *, const struct inode *,
const struct dentry *, const struct inode *,
unsigned int, const char *, const struct qstr *);
int (*d_delete)(const struct dentry *);
void (*d_release)(struct dentry *);
void (*d_prune)(struct dentry *);
void (*d_iput)(struct dentry *, struct inode *);
char *(*d_dname)(struct dentry *, char *, int);
struct vfsmount *(*d_automount)(struct path *);
int (*d_manage)(struct dentry *, bool);
} ____cacheline_aligned;
內核給進程訪問的路徑名上的每個部分都創建一個dentry,比如/users/yuyijq/test,這樣就會有 /, users, yuyijq, test四個dentry。dentry對象保存在slab cache里,這個也叫directory entry cache(or dcahce)。這樣就可以快速的尋找一個文件。一個或多個dentry會關聯到一個inode上,比如硬鏈接(hard link)。
路徑名查找(pathname lookup)
當我們使用open(), mkdir()等系統調用的時候,需要根據路徑名找出對應的VFS對象(inode, dentry等),這個過程就涉及pathname lookup。前面提到過,如果操作的這個pathname,是以"/"開頭的,則從current(當前進程)->fs->root開始查找,否則就是從current->fs->pwd開始。有了開始點,就可以遞歸的開始查找了。另外,為了加速這個查找過程,內核還使用了dentry cache(dcache)。
注冊文件系統
要注冊或反注冊一個文件系統,使用下面的函數<linux/fs.h>:
extern int register_filesystem(struct file_system_type *); extern int unregister_filesystem(struct file_system_type *);
從/proc/filesystems里可以看出系統里已經注冊的文件系統類型。
file_system_type
struct file_system_type {
const char *name;
int fs_flags;
#define FS_REQUIRES_DEV 1
#define FS_BINARY_MOUNTDATA 2
#define FS_HAS_SUBTYPE 4
#define FS_REVAL_DOT 16384 /* Check the paths ".", ".." for staleness */
#define FS_RENAME_DOES_D_MOVE 32768 /* FS will handle d_move() during rename() internally. */
struct dentry *(*mount) (struct file_system_type *, int,
const char *, void *);
void (*kill_sb) (struct super_block *);
struct module *owner;
struct file_system_type * next;
struct hlist_head fs_supers;
struct lock_class_key s_lock_key;
struct lock_class_key s_umount_key;
struct lock_class_key s_vfs_rename_key;
struct lock_class_key s_writers_key[SB_FREEZE_LEVELS];
struct lock_class_key i_lock_key;
struct lock_class_key i_mutex_key;
struct lock_class_key i_mutex_dir_key;
};
當使用mount掛載一個文件系統時,內部會調用對應的file_system_type的mount函數來掛載該文件系統。mount函數調用后會返回一個dentry,這個dentry會對應着一個super_block。mount方法會設置super_block的s_op指針,指向具體的實現。
struct file_system_type *get_fs_type(const char *name)函數會根據文件系統類型的名稱,遍歷已注冊的文件系統類型,返回對應的file_system_type對象。
在系統啟動階段,內核直接掛載根文件系統,其他文件系統由啟動腳本或用戶執行命令掛載到根文件系統。
設備文件
上面介紹的都是VFS的傳統作用,對應的也是什么磁盤文件系統Ext2, Ext3之類的。可我們都知道,在Unix/Linux里一切都是文件(當然除了網絡除外)。你要操作串口設備?打開一個設備文件,讀取啊,寫入啊就像操作普通文件一樣:open, read, write等。這又是怎么辦到的呢?這還是要歸功VFS的抽象能力。VFS很好的隱藏了這些設備之間的差異。
設備文件分為兩種類型
塊設備(block) 可以隨機尋址,傳輸一個數據塊的時間基本一樣。典型的塊設備有硬盤,軟盤,光盤等。
字符設備(character) 字符設備有的不可以隨機尋址(比如聲卡);有的可以隨機尋址,但隨機訪問一塊數據的時間依賴數據在設備中的位置(比如磁帶)。
設備文件和普通文件不同的是,它只是存儲在文件系統中。比如它的inode沒有指向磁盤上數據的指針,但是它包含一個標識設備的標識符。一般這個標識符由這些數據祖成:類型(塊或字符),一對數字。第一個是主數字,確定設備類型。一般,具有相同類型和相同主數字的設備文件具有相同的文件操作函數(想想VFS是怎么抽象的)。第二個數字是次數字,用來區分主數字相同的一組設備。比如,由相同的磁盤控制器控制的磁盤主數字是相同的,但次數字不同。
mknod系統調用可以用來創建設備文件:
SYSCALL_DEFINE3(mknod, const char __user *, filename, umode_t, mode, unsigned, dev)
{
return sys_mknodat(AT_FDCWD, filename, mode, dev);
}
設備文件一般位於/dev目錄(擦,我以前一直以為這個目錄是developer的意思)。要注意一點是,字符設備和塊設備之間的主次號是分開的。設備文件也不一定就對應着真實的物理設備。比如/dev/null就對應着一個黑洞,向這個文件寫東西會直接丟棄。對於內核來說文件的名稱沒有什么意義。
