在Linux文件系统中,inode就代表一个磁盘上的文件,所有磁盘文件的操作,最终都落到inode上去处理,inode的生命周期管理非常重要。但是inode的创建、写入、读取和删除,通常不是独立的流程,它是裹挟在其它的大IO流程中,比如创建文件。但是在其它流程中,inode的处理的介绍几乎是一笔带过,所以,有必要针对inode管理,总结一篇详实的材料。本文还是以ext2文件系统为例,来讲述inode生命周期的管理。
首先,你要知道inode管理的操作函数,都是在超级块结构体里声明,因为inode自己的操作表,都是和文件操作相关。好,我们一起来看一下:
// super.c
static const struct super_operations ext2_sops = {
.alloc_inode = ext2_alloc_inode,
.free_inode = ext2_free_in_core_inode,
.write_inode = ext2_write_inode,
.evict_inode = ext2_evict_inode,
.put_super = ext2_put_super,
.sync_fs = ext2_sync_fs,
.freeze_fs = ext2_freeze,
.unfreeze_fs = ext2_unfreeze,
.statfs = ext2_statfs,
.remount_fs = ext2_remount,
.show_options = ext2_show_options,
};
一、写在前面
在正式介绍Inode管理之前,我们先把相关函数,以及前置流程梳理出来,然后再依次介绍各个场景,这样更加清晰明了。
基本场景 | 实现函数 | 前置(触发)流程 | 前序函数 |
分配inode | ext2_alloc_inode() | ① 创建文件 ② 创建目录 ③ 创建symlink ④ 创建临时文件 ⑤ mknod系统调用(设备、Socket) | inode.c alloc_inode() |
写inode到磁盘 | ext2_write_inode() | ① 创建文件 ② 创建目录 ③ 创建symlink ④ 创建临时文件 ⑤ mknod系统调用(设备、Socket) ⑥ link和unlink文件 ⑦ 删除目录 | fs.h mark_inode_dirty() inode_inc_link_count() inode_dec_link_count() |
读取inode | ext2_iget() | ① 挂载文件系统,读取根inode ② lookup查找文件 | |
驱逐inode (标记i_dtime) | ext2_evict_inode() | ① 删除文件unlink ② 删除目录rmdir | fs.h inode_dec_link_count() inode.c evict() |
释放inode缓存 | ext2_free_in_core_inode() | ① 分配inode失败 ② 删除inode(evict) | inode.c evict() |
二、分配inode
所谓分配inode,就是创建inode,包括VFS使用inode对象,还有文件系统自身使用缓存inode,对于ext2文件系统就是struct ext2_inode_info。这两者都是内存空间,所以也就是分配内存。
以下代码就是分配inode,可以清楚的看到调用kmem_cache_alloc()分配缓存空间。有人可能会有疑问了,为啥没有分配VFS的inode内存空间。请注意在结构体struct ext2_inode_info中,vfs_inode域是结构体,不是结构体指针,这样分配一次,两者就都有了。
static struct inode *ext2_alloc_inode(struct super_block *sb)
{
struct ext2_inode_info *ei;
ei = kmem_cache_alloc(ext2_inode_cachep, GFP_KERNEL);
if (!ei)
return NULL;
ei->i_block_alloc_info = NULL;
inode_set_iversion(&ei->vfs_inode, 1);
......
return &ei->vfs_inode;
}
1.前序函数调用情况(pre)
分配inode一共有两种情况,一种是直接创建inode,一种是根据ino查询inode,如果没有查到,然后再创建inode。前者主要场景有:创建文件、创建目录、创建软链、创建临时文件;后者主要场景有:填充超级块时,根据ino查询根inode,然后就是文件的lookup。
2.后续处理情况(post)
1)如果是创建inode
在创建完空inode之后,一般有如下几个基本动作:
- ① 生成ino(查询inode bitmap)
- ② 填充vfs用的inode
- ③ 填充缓存用的inode,对于ext2文件系统就是ext2_inode_info
- ④ 设置脏inode标志,等待回写到磁盘设备
对于前三项,参考ialloc.c的ext2_new_inode()函数,第四项参考namei.c的ext2_create()函数。
2)如果是根据ino查询inode
如果没有查到inode,那就新创建一个inode。此时ino是已知的,就根据ino到磁盘检索inode信息:
- ① 根据ino从磁盘读取inode
- ② 填充vfs用的inode
- ③ 填充缓存用的inode,对于ext2文件系统就是ext2_inode_info
参考代码inode.c的ext2_iget()函数。
三、写Inode到磁盘
写入inode到磁盘块设备,流程比较清晰,就是根据ino计算出inode所在块,把磁盘块映射到缓存(bufferhead),然后根据缓存中的inode修改缓存的内容,最后设置脏标记等待同步到磁盘。
// inode.c
static int __ext2_write_inode(struct inode *inode, int do_sync)
{
......
struct ext2_inode * raw_inode = ext2_get_inode(sb, ino, &bh);
int n;
int err = 0;
if (IS_ERR(raw_inode))
return -EIO;
/* For fields not not tracking in the in-memory inode,
* initialise them to zero for new inodes. */
if (ei->i_state & EXT2_STATE_NEW)
memset(raw_inode, 0, EXT2_SB(sb)->s_inode_size);
raw_inode->i_mode = cpu_to_le16(inode->i_mode);
......
raw_inode->i_links_count = cpu_to_le16(inode->i_nlink);
raw_inode->i_size = cpu_to_le32(inode->i_size);
raw_inode->i_atime = cpu_to_le32(inode->i_atime.tv_sec);
raw_inode->i_ctime = cpu_to_le32(inode->i_ctime.tv_sec);
raw_inode->i_mtime = cpu_to_le32(inode->i_mtime.tv_sec);
raw_inode->i_blocks = cpu_to_le32(inode->i_blocks);
raw_inode->i_dtime = cpu_to_le32(ei->i_dtime);
raw_inode->i_flags = cpu_to_le32(ei->i_flags);
raw_inode->i_faddr = cpu_to_le32(ei->i_faddr);
raw_inode->i_frag = ei->i_frag_no;
raw_inode->i_fsize = ei->i_frag_size;
raw_inode->i_file_acl = cpu_to_le32(ei->i_file_acl);
......
raw_inode->i_generation = cpu_to_le32(inode->i_generation);
if (S_ISCHR(inode->i_mode) || S_ISBLK(inode->i_mode)) {
if (old_valid_dev(inode->i_rdev)) {
raw_inode->i_block[0] =
cpu_to_le32(old_encode_dev(inode->i_rdev));
raw_inode->i_block[1] = 0;
} else {
raw_inode->i_block[0] = 0;
raw_inode->i_block[1] =
cpu_to_le32(new_encode_dev(inode->i_rdev));
raw_inode->i_block[2] = 0;
}
} else for (n = 0; n < EXT2_N_BLOCKS; n++)
raw_inode->i_block[n] = ei->i_data[n];
mark_buffer_dirty(bh);
......
ei->i_state &= ~EXT2_STATE_NEW;
brelse (bh);
return err;
}
1)前序函数调用情况
inode在标记为脏的情况下,会被调度回写到块设备,也就是调用write_inode()函数。哪些场景会标记inode为脏呢,那就是涉及inode创建、修改的地方,都要标记为脏inode,具体分类如下:
- ① 创建inode(创建文件、创建目录、创建软链、创建临时文件、重命名文件)
- ② 删除inode(删除文件)
- ③ 修改文件数据(写文件)
- ④ 设置文件属性(修改ACL、修改属主等等)
2)触发写入inode的流程(标记脏inode)
标记脏inode作用,其实就是把inode挂到脏inode列表(块设备),等待调度写入到磁盘上,具体看下面几行标红色的代码。
// fs-writeback.c
void __mark_inode_dirty(struct inode *inode, int flags)
{
struct super_block *sb = inode->i_sb;
int dirtytime = 0;
.....
if (((inode->i_state & flags) == flags) ||
(dirtytime && (inode->i_state & I_DIRTY_INODE)))
return;
spin_lock(&inode->i_lock);
if (dirtytime && (inode->i_state & I_DIRTY_INODE))
goto out_unlock_inode;
if ((inode->i_state & flags) != flags) {
const int was_dirty = inode->i_state & I_DIRTY;
inode_attach_wb(inode, NULL);
/* I_DIRTY_INODE supersedes I_DIRTY_TIME. */
if (flags & I_DIRTY_INODE)
inode->i_state &= ~I_DIRTY_TIME;
inode->i_state |= flags;
/*
* If the inode is queued for writeback by flush worker, just
* update its dirty state. Once the flush worker is done with
* the inode it will place it on the appropriate superblock
* list, based upon its state.
*/
if (inode->i_state & I_SYNC_QUEUED)
goto out_unlock_inode;
/*
* Only add valid (hashed) inodes to the superblock's
* dirty list. Add blockdev inodes as well.
*/
if (!S_ISBLK(inode->i_mode)) {
if (inode_unhashed(inode))
goto out_unlock_inode;
}
if (inode->i_state & I_FREEING)
goto out_unlock_inode;
/*
* If the inode was already on b_dirty/b_io/b_more_io, don't
* reposition it (that would break b_dirty time-ordering).
*/
if (!was_dirty) {
struct bdi_writeback *wb;
struct list_head *dirty_list;
bool wakeup_bdi = false;
wb = locked_inode_to_wb_and_lock_list(inode);
inode->dirtied_when = jiffies;
if (dirtytime)
inode->dirtied_time_when = jiffies;
if (inode->i_state & I_DIRTY)
dirty_list = &wb->b_dirty;
else
dirty_list = &wb->b_dirty_time;
wakeup_bdi = inode_io_list_move_locked(inode, wb,
dirty_list);
spin_unlock(&wb->list_lock);
trace_writeback_dirty_inode_enqueue(inode);
/*
* If this is the first dirty inode for this bdi,
* we have to wake-up the corresponding bdi thread
* to make sure background write-back happens
* later.
*/
if (wakeup_bdi &&
(wb->bdi->capabilities & BDI_CAP_WRITEBACK))
wb_wakeup_delayed(wb);
return;
}
}
out_unlock_inode:
spin_unlock(&inode->i_lock);
}
struct bdi_writeback {
struct backing_dev_info *bdi; /* our parent bdi */
unsigned long state; /* Always use atomic bitops on this */
unsigned long last_old_flush; /* last old data flush */
struct list_head b_dirty; /* dirty inodes */
struct list_head b_io; /* parked for writeback */
struct list_head b_more_io; /* parked for more writeback */
struct list_head b_dirty_time; /* time stamps are dirty */
spinlock_t list_lock; /* protects the b_* lists */
atomic_t writeback_inodes; /* number of inodes under writeback */
struct percpu_counter stat[NR_WB_STAT_ITEMS];
......
};
四、读取inode
在挂载文件系统,加载根inode信息,或者stat()操作缓存中没有命中inode信息,需要进行lookup时,就需要从磁盘读取inode。读取inode流程比较简单,根据ino找到块组,然后根据块组找到块组描述符,然后根据inode表找到所在的块,通过sb_bread()读取块信息。
// inode.c
static struct ext2_inode *ext2_get_inode(struct super_block *sb, ino_t ino,
struct buffer_head **p)
{
......
block_group = (ino - 1) / EXT2_INODES_PER_GROUP(sb);
gdp = ext2_get_group_desc(sb, block_group, NULL);
if (!gdp)
goto Egdp;
/*
* Figure out the offset within the block group inode table
*/
offset = ((ino - 1) % EXT2_INODES_PER_GROUP(sb)) * EXT2_INODE_SIZE(sb);
block = le32_to_cpu(gdp->bg_inode_table) +
(offset >> EXT2_BLOCK_SIZE_BITS(sb));
if (!(bh = sb_bread(sb, block)))
goto Eio;
*p = bh;
offset &= (EXT2_BLOCK_SIZE(sb) - 1);
return (struct ext2_inode *) (bh->b_data + offset);
......
}
五、驱逐inode
删除目录或者删除文件,当inode硬链接为0时,最后一次iput将会调用驱逐函数evict_inode(),完成inode的驱逐。驱逐Inode的流程有:
- ① truncate inode关联的页面缓存(truncate_inode_pages_final())
- ② 设置磁盘上的inode,将删除时间i_dtime置为当前时间,表示已删除
- ③ truncate inode关联的数据块(ext2_truncate_blocks())
- ④ 清理inode关联的缓冲区(invalidate_inode_buffers(())
- ⑤ 清理inode数据状态数据(clear_inode())
- ⑥ 丢弃预置块(ext2_discard_reservation())
- ⑦ 释放inode的缓存(ext2_free_inode())
// inode.c
/*
* Called at the last iput() if i_nlink is zero.
*/
void ext2_evict_inode(struct inode * inode)
{
struct ext2_block_alloc_info *rsv;
int want_delete = 0;
if (!inode->i_nlink && !is_bad_inode(inode)) {
want_delete = 1;
dquot_initialize(inode);
} else {
dquot_drop(inode);
}
truncate_inode_pages_final(&inode->i_data);
if (want_delete) {
sb_start_intwrite(inode->i_sb);
/* set dtime */
EXT2_I(inode)->i_dtime = ktime_get_real_seconds();
mark_inode_dirty(inode);
__ext2_write_inode(inode, inode_needs_sync(inode));
/* truncate to 0 */
inode->i_size = 0;
if (inode->i_blocks)
ext2_truncate_blocks(inode, 0);
ext2_xattr_delete_inode(inode);
}
invalidate_inode_buffers(inode);
clear_inode(inode);
ext2_discard_reservation(inode);
rsv = EXT2_I(inode)->i_block_alloc_info;
EXT2_I(inode)->i_block_alloc_info = NULL;
if (unlikely(rsv))
kfree(rsv);
if (want_delete) {
ext2_free_inode(inode);
sb_end_intwrite(inode->i_sb);
}
}
六、释放inode
驱逐inode时,会调用释放inode,完成最后清理动作,这个流程主要包括以下几个方面:
- ① 清理inode缓存
- ② 将磁盘上的inode位图清0
//super.c
static void ext2_free_in_core_inode(struct inode *inode)
{
kmem_cache_free(ext2_inode_cachep, EXT2_I(inode));
}
void ext2_free_inode (struct inode * inode)
{
......
ino = inode->i_ino;
ext2_debug ("freeing inode %lu\n", ino);
......
es = EXT2_SB(sb)->s_es;
is_directory = S_ISDIR(inode->i_mode);
......
block_group = (ino - 1) / EXT2_INODES_PER_GROUP(sb);
bit = (ino - 1) % EXT2_INODES_PER_GROUP(sb);
bitmap_bh = read_inode_bitmap(sb, block_group);
if (!bitmap_bh)
return;
/* Ok, now we can actually update the inode bitmaps.. */
if (!ext2_clear_bit_atomic(sb_bgl_lock(EXT2_SB(sb), block_group),
bit, (void *) bitmap_bh->b_data))
ext2_error (sb, "ext2_free_inode",
"bit already cleared for inode %lu", ino);
else
ext2_release_inode(sb, block_group, is_directory);
mark_buffer_dirty(bitmap_bh);
if (sb->s_flags & SB_SYNCHRONOUS)
sync_dirty_buffer(bitmap_bh);
brelse(bitmap_bh);
}
参考资料:
创建文件:05 ext2文件系统IO流程:创建文件create
内核文档:VFS
内核版本:5.16.7