rgw中的gc 流程

Last updated on 7 months ago

使用场景

什么是gc 流程

Garbage Collection缩写GC，简称垃圾回收；

在ceph 中，经常会有删除文件的操作，删除一个文件并不是同步删除，而是采用异步方式，rgw是通过 gc回收机制（周期性）异步删除对象的；

常规删除场景下：删除一个obj，并不是马上释放空间，而是将要删除的对象信息，加入到 gc队列中，gc线程（一个rgw对应一个gc 线程）会周期性的释放磁盘空间，每次释放多少空间也有限定；

先看看效果

上传并删除一个文件

[root@node86 out]#  s3cmd put   mgr.x.log s3://hrp
upload: 'mgr.x.log' -> 's3://hrp/mgr.x.log'  [1 of 1]
 65016842 of 65016842   100% in    1s    39.40 MB/s  done
    
[root@node86 out]#  s3cmd rm   s3://hrp/mgr.x.log
delete: 's3://hrp/mgr.x.log'

发现底层文件还在


[root@node86 build]# rados -p expondata.rgw.buckets.data  ls --all|grep mgr
	b4c4f8c1-3b5d-4cc7-ac6a-a00071fc02be.14113.1__shadow_mgr.x.log.82F25-CknxUPgV0I9JMfLqw054nSz8u_14
	b4c4f8c1-3b5d-4cc7-ac6a-a00071fc02be.14113.1__shadow_mgr.x.log.82F25-CknxUPgV0I9JMfLqw054nSz8u_6
	b4c4f8c1-3b5d-4cc7-ac6a-a00071fc02be.14113.1__shadow_mgr.x.log.82F25-CknxUPgV0I9JMfLqw054nSz8u_12
.......
	b4c4f8c1-3b5d-4cc7-ac6a-a00071fc02be.14113.1__shadow_mgr.x.log.82F25-CknxUPgV0I9JMfLqw054nSz8u_15
	b4c4f8c1-3b5d-4cc7-ac6a-a00071fc02be.14113.1__shadow_mgr.x.log.82F25-CknxUPgV0I9JMfLqw054nSz8u_13
	b4c4f8c1-3b5d-4cc7-ac6a-a00071fc02be.14113.1__shadow_mgr.x.log.82F25-CknxUPgV0I9JMfLqw054nSz8u_7
[root@node86 build]#

删除后查看gc队列：
log 池里面多个gc 对象，该对象上存放着待回收的对象（omap 形式），我们可以用 rados 相关命令查看;

以 kv形式存，key 为obj etag（？？），value 为删除的条目

[root@node86 build]# rados listomapvals -p expondata.rgw.log -N gc gc.9 |more

key (48 bytes):
00000000  30 5f 62 34 63 34 66 38  63 31 2d 33 62 35 64 2d  |0_b4c4f8c1-3b5d-|
00000010  34 63 63 37 2d 61 63 36  61 2d 61 30 30 30 37 31  |4cc7-ac6a-a00071|
00000020  66 63 30 32 62 65 2e 38  34 31 31 31 2e 34 37 00  |fc02be.84111.47.|
00000030

value (4086 bytes) :
00000000  01 01 f0 0f 00 00 2e 00  00 00 62 34 63 34 66 38  |..........b4c4f8|
00000010  63 31 2d 33 62 35 64 2d  34 63 63 37 2d 61 63 36  |c1-3b5d-4cc7-ac6|
00000020  61 2d 61 30 30 30 37 31  66 63 30 32 62 65 2e 38  |a-a00071fc02be.8|
00000030  34 31 31 31 2e 34 37 00  01 01 b0 0f 00 00 10 00  |4111.47.........|
........
000000e0  2d 33 62 35 64 2d 34 63  63 37 2d 61 63 36 61 2d  |-3b5d-4cc7-ac6a-|
000000f0  61 30 30 30 37 31 66 63  30 32 62 65 2e 31 34 31  |a00071fc02be.141|
00000100  31 33 2e 31 5f 5f 73 68  61 64 6f 77 5f 6d 67 72  |13.1__shadow_mgr|
00000110  2e 78 2e 6c 6f 67 2e 38  32 46 32 35 2d 43 6b 6e  |.x.log.82F25-Ckn|
00000120  78 55 50 67 56 30 49 39  4a 4d 66 4c 71 77 30 35  |xUPgV0I9JMfLqw05|
00000130  34 6e 53 7a 38 75 5f 30  00 00 00 00 02 01 f4 00  |4nSz8u_0........|
00000140  00 00 1a 00 00 00 65 78  70 6f 6e 64 61 74 61 2e  |......expondata.|
00000150  72 67 77 2e 62 75 63 6b  65 74 73 2e 64 61 74 61  |rgw.buckets.data|

用 radosgw-admin gc list –include-all 也可以可能到 gc 队列里面的 ( 没加 –include-all，则是看到当前进行中的)

[root@node86 build]# radosgw-admin gc list --include-all

[
    {
        "tag": "b4c4f8c1-3b5d-4cc7-ac6a-a00071fc02be.84111.47\u0000",
        "time": "2024-03-03 03:19:15.0.059423",
        "objs": [
            {
                "pool": "expondata.rgw.buckets.data",
                "oid": "b4c4f8c1-3b5d-4cc7-ac6a-a00071fc02be.14113.1__shadow_mgr.x.log.82F25-CknxUPgV0I9JMfLqw054nSz8u_0",
                "key": "",
                "instance": ""
            },
            .......
            {
                "pool": "expondata.rgw.buckets.data",
                "oid": "b4c4f8c1-3b5d-4cc7-ac6a-a00071fc02be.14113.1__shadow_mgr.x.log.82F25-CknxUPgV0I9JMfLqw054nSz8u_15",
                "key": "",
                "instance": ""
            }
        ]
    }
]

此时底层是还有数据的，我们手动触发下gc 流程看看效果

radosgw-admin gc process --include-all

//数据还在
[root@node86 build]# rados -p expondata.rgw.buckets.data ls |grep mgr
b4c4f8c1-3b5d-4cc7-ac6a-a00071fc02be.14113.1__shadow_mgr.x.log.82F25-CknxUPgV0I9JMfLqw054nSz8u_14
b4c4f8c1-3b5d-4cc7-ac6a-a00071fc02be.14113.1__shadow_mgr.x.log.82F25-CknxUPgV0I9JMfLqw054nSz8u_6
b4c4f8c1-3b5d-4cc7-ac6a-a00071fc02be.14113.1__shadow_mgr.x.log.82F25-CknxUPgV0I9JMfLqw054nSz8u_12
b4c4f8c1-3b5d-4cc7-ac6a-a00071fc02be.14113.1__shadow_mgr.x.log.82F25-CknxUPgV0I9JMfLqw054nSz8u_2
b4c4f8c1-3b5d-4cc7-ac6a-a00071fc02be.14113.1__shadow_mgr.x.log.82F25-CknxUPgV0I9JMfLqw054nSz8u_4
b4c4f8c1-3b5d-4cc7-ac6a-a00071fc02be.14113.1__shadow_mgr.x.log.82F25-CknxUPgV0I9JMfLqw054nSz8u_3
b4c4f8c1-3b5d-4cc7-ac6a-a00071fc02be.14113.1__shadow_mgr.x.log.82F25-CknxUPgV0I9JMfLqw054nSz8u_8
b4c4f8c1-3b5d-4cc7-ac6a-a00071fc02be.14113.1__shadow_mgr.x.log.82F25-CknxUPgV0I9JMfLqw054nSz8u_9
b4c4f8c1-3b5d-4cc7-ac6a-a00071fc02be.14113.1__shadow_mgr.x.log.82F25-CknxUPgV0I9JMfLqw054nSz8u_5
b4c4f8c1-3b5d-4cc7-ac6a-a00071fc02be.14113.1__shadow_mgr.x.log.82F25-CknxUPgV0I9JMfLqw054nSz8u_0
b4c4f8c1-3b5d-4cc7-ac6a-a00071fc02be.14113.1__shadow_mgr.x.log.82F25-CknxUPgV0I9JMfLqw054nSz8u_1
b4c4f8c1-3b5d-4cc7-ac6a-a00071fc02be.14113.1__shadow_mgr.x.log.82F25-CknxUPgV0I9JMfLqw054nSz8u_11
b4c4f8c1-3b5d-4cc7-ac6a-a00071fc02be.14113.1__shadow_mgr.x.log.82F25-CknxUPgV0I9JMfLqw054nSz8u_10
b4c4f8c1-3b5d-4cc7-ac6a-a00071fc02be.14113.1__shadow_mgr.x.log.82F25-CknxUPgV0I9JMfLqw054nSz8u_15
b4c4f8c1-3b5d-4cc7-ac6a-a00071fc02be.14113.1__shadow_mgr.x.log.82F25-CknxUPgV0I9JMfLqw054nSz8u_13
b4c4f8c1-3b5d-4cc7-ac6a-a00071fc02be.14113.1__shadow_mgr.x.log.82F25-CknxUPgV0I9JMfLqw054nSz8u_7

#手动触发
[root@node86 build]# radosgw-admin gc  process --include-all
2024-03-03 01:44:18.454 7fe7a062bb00  0 svc_http_manager.cc:52:do_start:start http managers
2024-03-03 01:44:19.450 7fe7a062bb00  0 svc_http_manager.cc:39:shutdown:shutdown http manager
2024-03-03 01:44:19.466 7fe7a062bb00  0 svc_http_manager.cc:39:shutdown:shutdown http manager

#数据已经清楚
[root@node86 build]# date; rados -p expondata.rgw.buckets.data ls |grep mgr
Sun Mar  3 01:44:21 EST 2024
2024-03-03 01:44:21.861 7f6dbaadcd00 -1 ceph_context.cc:385:handle_conf_change:WARNING: all dangerous and experimental features are enabled.
2024-03-03 01:44:21.908 7f6dbaadcd00 -1 ceph_context.cc:385:handle_conf_change:WARNING: all dangerous and experimental features are enabled.
2024-03-03 01:44:21.943 7f6dbaadcd00 -1 ceph_context.cc:385:handle_conf_change:WARNING: all dangerous and experimental features are enabled.

#gc 队列里面也没有了
[root@node86 build]# date; radosgw-admin gc list --include-all
Sun Mar  3 01:44:42 EST 2024
2024-03-03 01:44:42.491 7f30a3fdfb00  0 svc_http_manager.cc:52:do_start:start http managers
[]
2024-03-03 01:44:42.724 7f30a3fdfb00  0 svc_http_manager.cc:39:shutdown:shutdown http manager
2024-03-03 01:44:42.739 7f30a3fdfb00  0 svc_http_manager.cc:39:shutdown:shutdown http manager
[root@node86 build]#

什么场景下会出现gc 流程？

在RGW中GC一般都是指一些异步的磁盘空间回收操作

客户端执行删除Object操作，对应的Object所占用的磁盘空间会交由后台GC处理。
客户端执行Object覆盖写入操作，旧Object相关的空间需要释放。
客户端执行上传操作（分块上传），上传过程中会产生一些shadow文件，这些上传过程中产生的临时数据也会纳入GC的回收

gc 模块怎么初始化的？

gc模块是在 rgw启动的时候初始化，启动的时候会在 log池创建 rgw_gc_max_objs 个对象（默认32个），最大可以设置 65521 个，gc obj 用来存放待回收数据的元数据信息（上面也提及了）

对象命名方式为 gc.<num>

待回收的数据会hash到不同的 gc obj上，用来提高系统的性能（shard 的思想在 bucket list 优化中也有用到，后续解释bucket shard 的时候会深入介绍）

//这里设置 1个，所以 log池也就只有1个
[root@node86 build]# ceph daemon out/radosgw.8000.asok config show|grep rgw_gc_max_objs
    "rgw_gc_max_objs": "1",
[root@node86 build]# rados ls -p expondata.rgw.log -N gc
gc.0

创建 gc 对象后，会启动一个线程，启动过程和其他组件模块一样，这里不在叙述，主要介绍 gc 线程的entry 函数， rgw_gc_processor_period 参数决定了多长时间执行一次gc 函数（默认 2h），休眠时采用互斥锁加条件变量的方式

gc 回收函数里面做了什么？

RGWGC::process 指定了 gc 流程要执行多久的时间，可以用参数 rgw_gc_processor_max_time 来限定
随后指定逐个遍历 gc obj，接下来主要分析 RGWGC::process

int RGWGC::process(bool expired_only)
{
  int max_secs = cct->_conf->rgw_gc_processor_max_time;
  //随机生成一个数，为后面随机选择 gc 做准备
  const int start = ceph::util::generate_random_number(0, max_objs - 1);
  //RGWGCIOManager 用于控制io删除速度 删除的io过程 
  RGWGCIOManager io_manager(this, store->ctx(), this);
  // 遍历gc 文件
  for (int i = 0; i < max_objs; i++) {
    int index = (i + start) % max_objs;
    int ret = process(index, max_secs, expired_only, io_manager);
    if (ret < 0)
      return ret;
  }
  if (!going_down()) {
    io_manager.drain();
  }

  return 0;
}

gc 怎么删除数据的？

先看整体流程

首先ceph 提供的cls 接口获取 gc对象上的 omap ，value 是一个list （我们先不要关注这tag 是什么）

获取到omap后，遍历value, value 包含了

{
 "pool": "expondata.rgw.buckets.data",
 "oid": "b4c4f8c1-3b5d-4cc7-ac6a-a00071fc02be.14113.1__shadow_mgr.x.log.82F25-CknxUPgV0I9JMfLqw054nSz8u_15",
 "key": "",
 "instance": ""
}

根据value 提供的名字，和池名字，构建 op

int RGWGC::process(){
 ctx = new IoCtx;
 ret = rgw_init_ioctx(store->get_rados_handle(), obj.pool, *ctx);
 ....
 ctx->locator_set_key(obj.loc);
 
 ObjectWriteOperation op;
 // 对tag 会有个引用计数 ，通过 cls_rc_refcount_put 减一
 // 当应用计数为0 的时候 执行 cls_cxx_remove 删除 对象
 cls_refcount_put(op, info.tag, true);
 // 每个 删除操作 交由 io_manager 管理， 见下文
 ret = io_manager.schedule_io(ctx, oid, &op, index, info.tag);
}

schedule_io


 int schedule_io(IoCtx *ioctx, const string& oid, ObjectWriteOperation *op,
	  int index, const string& tag) {
  //  io队列满了，执行 handle_next_completio  
  //  当队列大小大于 max_aio 是，处理 aio
   while (ios.size() > max_aio) {
     if (gc->going_down()) {
       return 0;
     }
     handle_next_completion();
   }
// 对于每个io ,都有个注册个回调函数，注册到 ios堆栈的
   AioCompletion *c = librados::Rados::aio_create_completion(NULL, NULL, NULL);
   int ret = ioctx->aio_operate(oid, c, op);
   if (ret < 0) {
     return ret;
   }
   ios.push_back(IO{IO::TailIO, c, oid, index, tag});

   return 0;
 }

handle_next_completion


 void handle_next_completion() {
   ceph_assert(!ios.empty());
   // 从队列pop 出一个io 
   IO& io = ios.front();
   io.c->wait_for_safe();
   int ret = io.c->get_return_value();
   io.c->release();

   if (ret == -ENOENT) {
     ret = 0;
   }
   //io 有两种，一种是 删除 io ，一种是 索引 io 
   if (io.type == IO::IndexIO) {
     if (ret < 0) {
       ldpp_dout(dpp, 0) << "WARNING: gc cleanup of tags on gc shard index=" <<
  io.index << " returned error, ret=" << ret << dendl;
     }
     goto done;
   }

   if (ret < 0) {
     ldpp_dout(dpp, 0) << "WARNING: gc could not remove oid=" << io.oid <<
", ret=" << ret << dendl;
     goto done;
   }
   //此时开始 删除 omap数据
   schedule_tag_removal(io.index, io.tag);

 done:
   ios.pop_front();
 }

删除数据后那些 oamp是怎么删除的呢？

当遍历 vlaue的时候，会oamp 的key （tag）加入到一个vector的map中
vector<map<string, size_t> > tag_io_size;

io_manager.add_tag_io_size(index, info.tag, chain.objs.size());

void add_tag_io_size(int index, string tag, size_t size) {
    auto& ts = tag_io_size[index];
    ts.emplace(tag, size);
}

加入后作用是？

在 handle_next_completion 处理一个gc op的回调，最后会尝刷新 tag_io_size


void schedule_tag_removal(int index, string tag) {
  //从vector 取出map
  auto& ts = tag_io_size[index];
  //找到对应的tag
  auto ts_it = ts.find(tag);
  //size为obj数量，因为处理了一个 obj，所以size减一
  if (ts_it != ts.end()) {
    auto& size = ts_it->second;
    --size;
    // wait all shadow obj delete return
    if (size != 0)
      return;
    ts.erase(ts_it);
  }
  //能走到这说明 这个omap上的对象都删除完了
  //把omap entry到 remove_tags 中
  auto& rt = remove_tags[index];
  rt.push_back(tag);
  //但并不是马上刷新，而是攒到一定数量（默认10）才 remove
  if (rt.size() >= (size_t)cct->_conf->rgw_gc_max_trim_chunk) {
    flush_remove_tags(index, rt);
  }
}

flush_remove_tags

void flush_remove_tags(int index, vector<string>& rt) {
   //构造io类型
   IO index_io;
   index_io.type = IO::IndexIO;
   index_io.index = index;
   // index 类型的io 
   // 程序执行完自动释放资源 
   auto rt_guard = make_scope_guard(
     [&]
{
  rt.clear();
}
     );
   //删除 omap
   int ret = gc->remove(index, rt, &index_io.c);
   //还是和删除对象的io一样， 把io push 到堆栈中，统一管理
   ios.push_back(index_io);
 }

gc list 里面的数据（gc上的omap ）怎么产生的？

以删除对象为例，

int RGWRados::Object::Delete::delete_obj(){
\\\
.... 
 int ret = target->complete_atomic_modification();
}

complete_atomic_modification

int RGWRados::Object::complete_atomic_modification()
{
  if (!state->has_manifest || state->keep_tail)
    return 0;

  cls_rgw_obj_chain chain;
  //obj有个manifes，记录了一个文件有多少个对象，以及对象的信息，
  //于是就遍历更新到cahin中
  store->update_gc_chain(obj, state->manifest, &chain);
 
  if (chain.empty()) {
    return 0;
  }
  //最后用 用对象的tailtag做为 oamp的key
  //根据tag 来随机hash 到指定gc obj上 
  string tag = (state->tail_tag.length() > 0 ? state->tail_tag.to_str() : state->obj_tag.to_str());
  return store->gc->send_chain(chain, tag, false);  // do it async
}

这个tag 怎么来的？

从头捋了下，发现 tag 就是req->iq，用tag 作为gc对象上omap的索引应该是为了保证唯一性；

req->id

 int process_request
--->  s->req_id = store->svc.zone_utils->unique_id(req->id);
		2d94c7b3-a7d2-4365-b325-6e617d2f267e.4236.3
		<zoneID>.<instance_id><请求累计次数>

构造处理obj的类时： unique_tag = s->req_id

  AtomicObjectProcessor(Aio *aio, RGWRados *store,
                        const RGWBucketInfo& bucket_info,
                        const rgw_placement_rule *ptail_placement_rule,
                        const rgw_user& owner,
                        RGWObjectCtx& obj_ctx, const rgw_obj& head_obj,
                        std::optional<uint64_t> olh_epoch,
                        const std::string& unique_tag)

processor.emplace<AtomicObjectProcessor>(
    &aio, store, s->bucket_info, pdest_placement,
    s->bucket_owner.get_id(), obj_ctx, obj, olh_epoch, s->req_id);
}

shadow对象写完后，构造元数据时 obj_op.meta.ptag 为 unique_tag

int AtomicObjectProcessor::complete{
//...
	obj_op.meta.ptag = &unique_tag; /* use req_id as operation tag */
//...
}

state->write_tag

RGWRados::Object::prepare_atomic_modification{
	state->write_tag = *ptag;
	bufferlist bl;
	bl.append(state->write_tag.c_str(), state->write_tag.size() + 1);
	...
    op.setxattr(RGW_ATTR_TAIL_TAG, bl);
}

所以 taif_tag = s->req_id

gc 的整体流程并不是很复杂，思路也很清晰，但是里面的细节也很多，编码上时有很多学习的地方，比如 omap上的回收，并不是删除一条就来一次io，而是攒起来，统一下发，以减少性能的消耗；以及shard 的思想，用多个shard 提高并发；此外设计到很多cls的操作，没展开讲(单独一个章节介绍)；

对于 gc优化，如果业务场景对存储空间比较敏感，可以调节gc触发的时间，和执行时长，但这些操作还得根据业务压力作为判断

#ceph

编译过程<1>（编译和链接） Previous

多线程的好伙伴 voliate Next