这一节主要分析下util目录下的代码。这部分代码相对独立,功能明确,很容易看明白。可先尝试理解这部分代码,以为进一步理解leveldb代码打下基础。
1 内存管理(arena.cc/.h)
leveldb的大部分内存管理依赖于C++语言的默认实现,也就是不对内存进行管理。只是在memtable的实现中用到了一个简单的内存管理器(arena)。因为memtable的内部实现skip list写入时,需要分配新节点,大量节点的直接分配可能会带来较多的碎片,影响运行效率。因此,leveldb在每个memtable中都会绑定一个arena,在memtable进行minor compact后,memtable销毁时进行统一释放。
下图是一个arena某个运行时刻的截图。从图中可以看出,arena内部使用的基本块大小为4K,已分配的块的指保存在一个vector中。具体分配策略如下。若分配的内存可从剩余的块中分配,则直接从剩余块中分配并调整剩余块指针的位置。否则检查要分配的size是否大于1/4的block(也就是1K),若大于则直接new所需大小的内存,将指针保存在vector中并返回内存指针。从这里可以看出,vector保存的内存块的指针大小可能>4K,也就是对大内存块直接分配。前面两个条件若都不满足,则需要new一个基本块(=4K),将new出来的新块作为当前块,从这块当中分配内存并调整当前内存指针位置。原来当前块的剩余内存就被浪费掉了。
实际中可考虑用tcmalloc/jemalloc以提升内存管理效率。
// alloc_ptr_指向基本块空闲内存位置,alloc_bytes_remaining_为剩余内存大小 // blocks_保存了所有的块,包括了基本块与直接分配的大块 // Allocation state char* alloc_ptr_; size_t alloc_bytes_remaining_; // Array of new[] allocated memory blocks std::vector<char*> blocks_; // Bytes of memory in blocks allocated so far size_t blocks_memory_;
2 过滤器bloom filter的实现(bloom.cc)
leveldb通过接口NewBloomFilterPolicy提供了一个filter的实现,使用时通过传入该filter可提升对不存在的数查找时的效率。open db时的options.filter_policy参数指定该filter。如果指定了filter,leveldb会在生成新sst文件时对该文件的所有key构造一个合适的bloom filter字符串并保存在sst文件中。现在我们来看看这个bloom filter的实现机制。bloom filter需要指定bits_per_key参数,表示每个key需要检测多少个bit位/保存多少个bit位。创建bloom filter的核心代码如下:
virtual void CreateFilter(const Slice* keys, int n, std::string* dst) const { // Compute bloom filter size (in both bits and bytes) size_t bits = n * bits_per_key_; // For small n, we can see a very high false positive rate. Fix it // by enforcing a minimum bloom filter length. if (bits < 64) bits = 64; size_t bytes = (bits + 7) / 8; bits = bytes * 8; const size_t init_size = dst->size(); dst->resize(init_size + bytes, 0); // 将探测需要的尾数放至最后 dst->push_back(static_cast<char>(k_)); // Remember # of probes in filter char* array = &(*dst)[init_size]; for (size_t i = 0; i < n; i++) { // Use double-hashing to generate a sequence of hash values. // See analysis in [Kirsch,Mitzenmacher 2006]. uint32_t h = BloomHash(keys[i]); const uint32_t delta = (h >> 17) | (h << 15); // Rotate right 17 bits for (size_t j = 0; j < k_; j++) { const uint32_t bitpos = h % bits; array[bitpos/8] |= (1 << (bitpos % 8)); h += delta; } }
创建时,先预留bloom filter所需空间(n * bits_per_key_/8对齐后),并将需要检测的位数保存至字符串末尾,以方便使用时解出。然后遍历key,计算其hash值,将hash值所对应的bit位置1(需要设置bits_per_key_次,并且后续的bit位从原始的hash值右移17位得到)。过滤函数KeyMayMatch基本就是上述过程的逆过程。bloom filter字符串的内部结构可如下图所示。
3 默认的cache机制—LRU实现(cache.cc/.h)
leveldb的cache主要是为了加快get操作的性能。cache分为block cache与table cache。table cache针对sst文件,key是file_number,cache的大小为options.max_open_files – 10,在db初始化时创建,且cache机制使用默认的LRU算法,外部没法更改。
// Reserve ten files or so for other uses and give the rest to TableCache. const int table_cache_size = options.max_open_files - 10; table_cache_ = new TableCache(dbname_, &options_, table_cache_size);
而block_cache则可以在open时通过options.block_cache参数指定cache实例。外部可自定义合适的cache机制替换系统自带的LRU cache。如果外部没有提供block_cache参数,则leveldb会在初始化时构造8MB大小的cache,该cache同样是LRU cache的实例。该cache的key是block所属table文件的一个cache_id + block在文件中的偏移。
// 没有提供block_cache,则内部使用8MB的LRUCache if (result.block_cache == NULL) { result.block_cache = NewLRUCache(8 << 20); }
char cache_key_buffer[16]; EncodeFixed64(cache_key_buffer, table->rep_->cache_id); EncodeFixed64(cache_key_buffer+8, handle.offset()); Slice key(cache_key_buffer, sizeof(cache_key_buffer)); cache_handle = block_cache->Lookup(key);
如果用户想自定义block_cache算法,则可参考cache.h中Cache类所定义的抽象接口,主要接口代码如下所示。可以看出,也就是key的增删查改以及句柄的释放。
// When the inserted entry is no longer needed, the key and // value will be passed to "deleter". virtual Handle* Insert(const Slice& key, void* value, size_t charge, void (*deleter)(const Slice& key, void* value)) = 0; // If the cache has no mapping for "key", returns NULL. // // Else return a handle that corresponds to the mapping. The caller // must call this->Release(handle) when the returned mapping is no // longer needed. virtual Handle* Lookup(const Slice& key) = 0; // Release a mapping returned by a previous Lookup(). // REQUIRES: handle must not have been released yet. // REQUIRES: handle must have been returned by a method on *this. virtual void Release(Handle* handle) = 0; // Return the value encapsulated in a handle returned by a // successful Lookup(). // REQUIRES: handle must not have been released yet. // REQUIRES: handle must have been returned by a method on *this. virtual void* Value(Handle* handle) = 0; // If the cache contains entry for key, erase it. Note that the // underlying entry will be kept around until all existing handles // to it have been released. virtual void Erase(const Slice& key) = 0; // Return a new numeric id. May be used by multiple clients who are // sharing the same cache to partition the key space. Typically the // client will allocate a new id at startup and prepend the id to // its cache keys. virtual uint64_t NewId() = 0;
NewLRUCache函数可调用自带的LRU实现。现在我们来看看LRU的具体实现。下图是LRUCache的内部实现图与相关代码。
struct LRUHandle { void* value; void (*deleter)(const Slice&, void* value); LRUHandle* next_hash; // hash链表指针 LRUHandle* next; // 循环链表指针 LRUHandle* prev; size_t charge; // TODO(opt): Only allow uint32_t? size_t key_length; uint32_t refs; uint32_t hash; // Hash of key(); used for fast sharding and comparisons char key_data[1]; // Beginning of key -- }; class HandleTable { --- uint32_t length_; uint32_t elems_; LRUHandle** list_; --- }; class LRUCache { --- // Initialized before use. size_t capacity_; --- // Dummy head of LRU list. // lru.prev is newest entry, lru.next is oldest entry. LRUHandle lru_; HandleTable table_; }; static const int kNumShardBits = 4; static const int kNumShards = 1 << kNumShardBits; class ShardedLRUCache : public Cache { --- LRUCache shard_[kNumShards]; -- };
每个节点有3个指针,一个指针用于hash table的单链接指向。另外两个形成双向循环链表,以实现lru功能,方便控制cache的大小。插入时除了在hash table中查找合适的位置并插入,还需要插入到循环链表的开头。如果数据过多,就通过head来删除最老的节点。对于table,插入的元素个数大于桶的数目时,需要重新调整大小,新的size将是2的幂,但比元素个数多。
leveldb为了加快查找速度,并不只是使用了1个如上所示的LRUCache,而是16个(也就是说会按key再分一次桶),最终调用的是ShardedLRUCache接口。
4 内部编码/解析器(coding.cc/.h)
leveldb为了节省空间,可对int进行可变长度编码,也就是对小于2^7使用1个字节存储,小于2^14使用两个字节。。。。编码时将输入每隔7位分组,将字节的最高位始终置1,一直到最后1个字节的最高位置0表示结束。比如对int型的二进制数00000000,00000001,11011001,01000110,每隔7位分组则为111,0110010,1000110,再将每个字节的高位置1,则实际存储的为00000111,10110010,11000110。这样比起始终用4个字节存储,省了1个字节的存储空间。实际的编码代码如下:
char* EncodeVarint32(char* dst, uint32_t v) { // Operate on characters as unsigneds unsigned char* ptr = reinterpret_cast<unsigned char*>(dst); static const int B = 128; if (v < (1<<7)) { *(ptr++) = v; } else if (v < (1<<14)) { *(ptr++) = v | B; *(ptr++) = v>>7; } else if (v < (1<<21)) { *(ptr++) = v | B; *(ptr++) = (v>>7) | B; *(ptr++) = v>>14; } else if (v < (1<<28)) { *(ptr++) = v | B; *(ptr++) = (v>>7) | B; *(ptr++) = (v>>14) | B; *(ptr++) = v>>21; } else { *(ptr++) = v | B; *(ptr++) = (v>>7) | B; *(ptr++) = (v>>14) | B; *(ptr++) = (v>>21) | B; *(ptr++) = v>>28; } return reinterpret_cast<char*>(ptr);
5 默认比较器的实现—字节级比较器
由于leveldb内部是有序的,因此其比较器是至关重要的。头文件中的Comparator类提供了比较器的抽象接口。Name定义了该比较器的名字,会在open时进行检查。至于FindShortestSeparator、FindShortSuccessor表示什么意思,可参看默认的字节级比较器BytewiseComparator的实现,相当简单。
// Three-way comparison. Returns value: // < 0 iff "a" < "b", // == 0 iff "a" == "b", // > 0 iff "a" > "b" virtual int Compare(const Slice& a, const Slice& b) const = 0; // The name of the comparator. Used to check for comparator // mismatches (i.e., a DB created with one comparator is // accessed using a different comparator. // // The client of this package should switch to a new name whenever // the comparator implementation changes in a way that will cause // the relative ordering of any two keys to change. // // Names starting with "leveldb." are reserved and should not be used // by any clients of this package. virtual const char* Name() const = 0; // Advanced functions: these are used to reduce the space requirements // for internal data structures like index blocks. // If *start < limit, changes *start to a short string in [start,limit). // Simple comparator implementations may return with *start unchanged, // i.e., an implementation of this method that does nothing is correct. virtual void FindShortestSeparator( std::string* start, const Slice& limit) const = 0; // Changes *key to a short string >= *key. // Simple comparator implementations may return with *key unchanged, // i.e., an implementation of this method that does nothing is correct. virtual void FindShortSuccessor(std::string* key) const = 0;
未完待续
6 crc32
7 hash
8 histogram
9 logging.cc
10 random.h