10 ext2文件系统IO流程:inode的创建、写入、读取、删除、释放子流程

在Linux文件系统中,inode就代表一个磁盘上的文件,所有磁盘文件的操作,最终都落到inode上去处理,inode的生命周期管理非常重要。但是inode的创建、写入、读取和删除,通常不是独立的流程,它是裹挟在其它的大IO流程中,比如创建文件。但是在其它流程中,inode的处理的介绍几乎是一笔带过,所以,有必要针对inode管理,总结一篇详实的材料。本文还是以ext2文件系统为例,来讲述inode生命周期的管理。

首先,你要知道inode管理的操作函数,都是在超级块结构体里声明,因为inode自己的操作表,都是和文件操作相关。好,我们一起来看一下:

// super.c

static const struct super_operations ext2_sops = {
    .alloc_inode    = ext2_alloc_inode,
    .free_inode     = ext2_free_in_core_inode,
    .write_inode    = ext2_write_inode,
    .evict_inode    = ext2_evict_inode,
    .put_super      = ext2_put_super,
    .sync_fs        = ext2_sync_fs,
    .freeze_fs      = ext2_freeze,
    .unfreeze_fs    = ext2_unfreeze,
    .statfs         = ext2_statfs,
    .remount_fs     = ext2_remount,
    .show_options   = ext2_show_options,
};

一、写在前面

在正式介绍Inode管理之前,我们先把相关函数,以及前置流程梳理出来,然后再依次介绍各个场景,这样更加清晰明了。

基本场景实现函数前置(触发)流程前序函数
分配inodeext2_alloc_inode()① 创建文件
② 创建目录
③ 创建symlink
④ 创建临时文件
⑤ mknod系统调用(设备、Socket)
inode.c
alloc_inode()
写inode到磁盘ext2_write_inode()① 创建文件
② 创建目录
③ 创建symlink
④ 创建临时文件
⑤ mknod系统调用(设备、Socket)
⑥ link和unlink文件
⑦ 删除目录
fs.h
mark_inode_dirty()
inode_inc_link_count()
inode_dec_link_count()
读取inodeext2_iget()① 挂载文件系统,读取根inode
② lookup查找文件
驱逐inode
(标记i_dtime)
ext2_evict_inode()① 删除文件unlink
② 删除目录rmdir
fs.h
inode_dec_link_count()
inode.c
evict()
释放inode缓存ext2_free_in_core_inode()① 分配inode失败
② 删除inode(evict)
inode.c
evict()

二、分配inode

所谓分配inode,就是创建inode,包括VFS使用inode对象,还有文件系统自身使用缓存inode,对于ext2文件系统就是struct ext2_inode_info。这两者都是内存空间,所以也就是分配内存。

以下代码就是分配inode,可以清楚的看到调用kmem_cache_alloc()分配缓存空间。有人可能会有疑问了,为啥没有分配VFS的inode内存空间。请注意在结构体struct ext2_inode_info中,vfs_inode域是结构体,不是结构体指针,这样分配一次,两者就都有了。

static struct inode *ext2_alloc_inode(struct super_block *sb)
{
    struct ext2_inode_info *ei;
    ei = kmem_cache_alloc(ext2_inode_cachep, GFP_KERNEL);
    if (!ei)
        return NULL;
    ei->i_block_alloc_info = NULL;
    inode_set_iversion(&ei->vfs_inode, 1);
    ......
    return &ei->vfs_inode;
}

1.前序函数调用情况(pre)

分配inode一共有两种情况,一种是直接创建inode,一种是根据ino查询inode,如果没有查到,然后再创建inode。前者主要场景有:创建文件、创建目录、创建软链、创建临时文件;后者主要场景有:填充超级块时,根据ino查询根inode,然后就是文件的lookup。

ext2的inode分配
ext2的inode分配

2.后续处理情况(post)

1)如果是创建inode

在创建完空inode之后,一般有如下几个基本动作:

  • ① 生成ino(查询inode bitmap)
  • ② 填充vfs用的inode
  • ③ 填充缓存用的inode,对于ext2文件系统就是ext2_inode_info
  • ④ 设置脏inode标志,等待回写到磁盘设备

对于前三项,参考ialloc.c的ext2_new_inode()函数,第四项参考namei.c的ext2_create()函数。

2)如果是根据ino查询inode

如果没有查到inode,那就新创建一个inode。此时ino是已知的,就根据ino到磁盘检索inode信息:

  • ① 根据ino从磁盘读取inode
  • ② 填充vfs用的inode
  • ③ 填充缓存用的inode,对于ext2文件系统就是ext2_inode_info

参考代码inode.c的ext2_iget()函数。

三、写Inode到磁盘

写入inode到磁盘块设备,流程比较清晰,就是根据ino计算出inode所在块,把磁盘块映射到缓存(bufferhead),然后根据缓存中的inode修改缓存的内容,最后设置脏标记等待同步到磁盘。

// inode.c

static int __ext2_write_inode(struct inode *inode, int do_sync)
{
     ......
     struct ext2_inode * raw_inode = ext2_get_inode(sb, ino, &bh);
     int n;
     int err = 0;

     if (IS_ERR(raw_inode))
           return -EIO;

     /* For fields not not tracking in the in-memory inode,
      * initialise them to zero for new inodes. */
     if (ei->i_state & EXT2_STATE_NEW)
          memset(raw_inode, 0, EXT2_SB(sb)->s_inode_size);

     raw_inode->i_mode = cpu_to_le16(inode->i_mode);
     ......
     raw_inode->i_links_count = cpu_to_le16(inode->i_nlink);
     raw_inode->i_size = cpu_to_le32(inode->i_size);
     raw_inode->i_atime = cpu_to_le32(inode->i_atime.tv_sec);
     raw_inode->i_ctime = cpu_to_le32(inode->i_ctime.tv_sec);
     raw_inode->i_mtime = cpu_to_le32(inode->i_mtime.tv_sec);

     raw_inode->i_blocks = cpu_to_le32(inode->i_blocks);
     raw_inode->i_dtime = cpu_to_le32(ei->i_dtime);
     raw_inode->i_flags = cpu_to_le32(ei->i_flags);
     raw_inode->i_faddr = cpu_to_le32(ei->i_faddr);
     raw_inode->i_frag = ei->i_frag_no;
     raw_inode->i_fsize = ei->i_frag_size;
     raw_inode->i_file_acl = cpu_to_le32(ei->i_file_acl);
     ......
     raw_inode->i_generation = cpu_to_le32(inode->i_generation);
     if (S_ISCHR(inode->i_mode) || S_ISBLK(inode->i_mode)) {
          if (old_valid_dev(inode->i_rdev)) {
               raw_inode->i_block[0] =
                    cpu_to_le32(old_encode_dev(inode->i_rdev));
               raw_inode->i_block[1] = 0;
          } else {
               raw_inode->i_block[0] = 0;
               raw_inode->i_block[1] =
                    cpu_to_le32(new_encode_dev(inode->i_rdev));
               raw_inode->i_block[2] = 0;
          }
     } else for (n = 0; n < EXT2_N_BLOCKS; n++)
          raw_inode->i_block[n] = ei->i_data[n];
     mark_buffer_dirty(bh);
     ......
     ei->i_state &= ~EXT2_STATE_NEW;
     brelse (bh);
     return err;
}

1)前序函数调用情况

inode在标记为脏的情况下,会被调度回写到块设备,也就是调用write_inode()函数。哪些场景会标记inode为脏呢,那就是涉及inode创建、修改的地方,都要标记为脏inode,具体分类如下:

  • ① 创建inode(创建文件、创建目录、创建软链、创建临时文件、重命名文件)
  • ② 删除inode(删除文件)
  • ③ 修改文件数据(写文件)
  • ④ 设置文件属性(修改ACL、修改属主等等)

2)触发写入inode的流程(标记脏inode)

标记脏inode作用,其实就是把inode挂到脏inode列表(块设备),等待调度写入到磁盘上,具体看下面几行标红色的代码。

// fs-writeback.c

void __mark_inode_dirty(struct inode *inode, int flags)
{
     struct super_block *sb = inode->i_sb;
     int dirtytime = 0;

     .....

     if (((inode->i_state & flags) == flags) ||
         (dirtytime && (inode->i_state & I_DIRTY_INODE)))
          return;

     spin_lock(&inode->i_lock);
     if (dirtytime && (inode->i_state & I_DIRTY_INODE))
          goto out_unlock_inode;
     if ((inode->i_state & flags) != flags) {
          const int was_dirty = inode->i_state & I_DIRTY;

          inode_attach_wb(inode, NULL);

          /* I_DIRTY_INODE supersedes I_DIRTY_TIME. */
          if (flags & I_DIRTY_INODE)
               inode->i_state &= ~I_DIRTY_TIME;
          inode->i_state |= flags;

          /*
           * If the inode is queued for writeback by flush worker, just
           * update its dirty state. Once the flush worker is done with
           * the inode it will place it on the appropriate superblock
           * list, based upon its state.
           */
          if (inode->i_state & I_SYNC_QUEUED)
               goto out_unlock_inode;

          /*
           * Only add valid (hashed) inodes to the superblock's
           * dirty list.  Add blockdev inodes as well.
           */
          if (!S_ISBLK(inode->i_mode)) {
               if (inode_unhashed(inode))
                    goto out_unlock_inode;
          }
          if (inode->i_state & I_FREEING)
               goto out_unlock_inode;

          /*
           * If the inode was already on b_dirty/b_io/b_more_io, don't
           * reposition it (that would break b_dirty time-ordering).
           */
          if (!was_dirty) {
               struct bdi_writeback *wb;
               struct list_head *dirty_list;
               bool wakeup_bdi = false;

               wb = locked_inode_to_wb_and_lock_list(inode);

               inode->dirtied_when = jiffies;
               if (dirtytime)
                    inode->dirtied_time_when = jiffies;

               if (inode->i_state & I_DIRTY)
                    dirty_list = &wb->b_dirty;
               else
                    dirty_list = &wb->b_dirty_time;

               wakeup_bdi = inode_io_list_move_locked(inode, wb,
                                          dirty_list);

               spin_unlock(&wb->list_lock);
               trace_writeback_dirty_inode_enqueue(inode);

               /*
                * If this is the first dirty inode for this bdi,
                * we have to wake-up the corresponding bdi thread
                * to make sure background write-back happens
                * later.
                */
               if (wakeup_bdi &&
                   (wb->bdi->capabilities & BDI_CAP_WRITEBACK))
                    wb_wakeup_delayed(wb);
               return;
          }
     }
out_unlock_inode:
     spin_unlock(&inode->i_lock);
}
struct bdi_writeback {
	struct backing_dev_info *bdi;	/* our parent bdi */

	unsigned long state;		/* Always use atomic bitops on this */
	unsigned long last_old_flush;	/* last old data flush */

	struct list_head b_dirty;	/* dirty inodes */
	struct list_head b_io;		/* parked for writeback */
	struct list_head b_more_io;	/* parked for more writeback */
	struct list_head b_dirty_time;	/* time stamps are dirty */
	spinlock_t list_lock;		/* protects the b_* lists */

	atomic_t writeback_inodes;	/* number of inodes under writeback */
	struct percpu_counter stat[NR_WB_STAT_ITEMS];
    ......
};

四、读取inode

在挂载文件系统,加载根inode信息,或者stat()操作缓存中没有命中inode信息,需要进行lookup时,就需要从磁盘读取inode。读取inode流程比较简单,根据ino找到块组,然后根据块组找到块组描述符,然后根据inode表找到所在的块,通过sb_bread()读取块信息。

// inode.c

static struct ext2_inode *ext2_get_inode(struct super_block *sb, ino_t ino,
                         struct buffer_head **p)
{
     ......

     block_group = (ino - 1) / EXT2_INODES_PER_GROUP(sb);
     gdp = ext2_get_group_desc(sb, block_group, NULL);
     if (!gdp)
          goto Egdp;
     /*
      * Figure out the offset within the block group inode table
      */
     offset = ((ino - 1) % EXT2_INODES_PER_GROUP(sb)) * EXT2_INODE_SIZE(sb);
     block = le32_to_cpu(gdp->bg_inode_table) +
          (offset >> EXT2_BLOCK_SIZE_BITS(sb));
     if (!(bh = sb_bread(sb, block)))
          goto Eio;

     *p = bh;
     offset &= (EXT2_BLOCK_SIZE(sb) - 1);
     return (struct ext2_inode *) (bh->b_data + offset);
     ......
}

五、驱逐inode

删除目录或者删除文件,当inode硬链接为0时,最后一次iput将会调用驱逐函数evict_inode(),完成inode的驱逐。驱逐Inode的流程有:

  • ① truncate inode关联的页面缓存(truncate_inode_pages_final())
  • ② 设置磁盘上的inode,将删除时间i_dtime置为当前时间,表示已删除
  • ③ truncate inode关联的数据块(ext2_truncate_blocks())
  • ④ 清理inode关联的缓冲区(invalidate_inode_buffers(())
  • ⑤ 清理inode数据状态数据(clear_inode())
  • ⑥ 丢弃预置块(ext2_discard_reservation())
  • ⑦ 释放inode的缓存(ext2_free_inode())
// inode.c

/*
 * Called at the last iput() if i_nlink is zero.
 */
void ext2_evict_inode(struct inode * inode)
{
     struct ext2_block_alloc_info *rsv;
     int want_delete = 0;

     if (!inode->i_nlink && !is_bad_inode(inode)) {
          want_delete = 1;
          dquot_initialize(inode);
     } else {
          dquot_drop(inode);
     }

     truncate_inode_pages_final(&inode->i_data);

     if (want_delete) {
          sb_start_intwrite(inode->i_sb);
          /* set dtime */
          EXT2_I(inode)->i_dtime     = ktime_get_real_seconds();
          mark_inode_dirty(inode);
          __ext2_write_inode(inode, inode_needs_sync(inode));
          /* truncate to 0 */
          inode->i_size = 0;
          if (inode->i_blocks)
               ext2_truncate_blocks(inode, 0);
          ext2_xattr_delete_inode(inode);
     }

     invalidate_inode_buffers(inode);
     clear_inode(inode);

     ext2_discard_reservation(inode);
     rsv = EXT2_I(inode)->i_block_alloc_info;
     EXT2_I(inode)->i_block_alloc_info = NULL;
     if (unlikely(rsv))
          kfree(rsv);

     if (want_delete) {
          ext2_free_inode(inode);
          sb_end_intwrite(inode->i_sb);
     }
}

六、释放inode

驱逐inode时,会调用释放inode,完成最后清理动作,这个流程主要包括以下几个方面:

  • ① 清理inode缓存
  • ② 将磁盘上的inode位图清0
//super.c

static void ext2_free_in_core_inode(struct inode *inode)
{
	kmem_cache_free(ext2_inode_cachep, EXT2_I(inode));
}
void ext2_free_inode (struct inode * inode)
{
     ......

     ino = inode->i_ino;
     ext2_debug ("freeing inode %lu\n", ino);

     ......

     es = EXT2_SB(sb)->s_es;
     is_directory = S_ISDIR(inode->i_mode);

     ......
     
     block_group = (ino - 1) / EXT2_INODES_PER_GROUP(sb);
     bit = (ino - 1) % EXT2_INODES_PER_GROUP(sb);
     bitmap_bh = read_inode_bitmap(sb, block_group);
     if (!bitmap_bh)
          return;

     /* Ok, now we can actually update the inode bitmaps.. */
     if (!ext2_clear_bit_atomic(sb_bgl_lock(EXT2_SB(sb), block_group),
                    bit, (void *) bitmap_bh->b_data))
          ext2_error (sb, "ext2_free_inode",
                     "bit already cleared for inode %lu", ino);
     else
          ext2_release_inode(sb, block_group, is_directory);
     mark_buffer_dirty(bitmap_bh);
     if (sb->s_flags & SB_SYNCHRONOUS)
          sync_dirty_buffer(bitmap_bh);

     brelse(bitmap_bh);
}

参考资料:

创建文件:05 ext2文件系统IO流程:创建文件create

内核文档:VFS

内核版本:5.16.7

发表回复

您的邮箱地址不会被公开。 必填项已用 * 标注