コンパイル時定数ダブル配列を作って遊んだ

内容はタイトルの通りで、C++17でコンパイル時定数として扱えるダブル配列を実装して遊びました。ご承知の通り、コンパイル時計算は最高ですしダブル配列も最高です。よってこの試みは最高です。¹

実装は以下で公開しています。

本記事では、この実装について簡単に紹介します。といっても現代の constexpr は非常に柔軟で優秀なので、特にテクニカルなことはしなくて済みました。良き。

使用例
コンパイル時計算の制限緩和
まとめ

使用例

constexpr_doublearray はヘッダファイルのみで構成されるので、ディレクトリ include にパスを通すだけで使っていただけます。

$ ls -1 include
constexpr_doublearray.hpp
fixed_capacity_vector

constexpr_doublearray.hpp はコンパイル時定数ダブル配列の実装といくつかのユーティリティ関数を含みます。
fixed_capacity_vector はgnzlbg氏による定数配列の実装です。いわゆる Boost の static_vector の constexpr 対応版です。

sample.cpp に沿って簡単に使用例を解説していきます。

前処理

text のような単語がNULL文字で区切られた文字列定数を用意していただければ、関数 text_to_words によってビルダーに渡す用の単語配列 words が構築できます。コンパイル時には動的な配列のアロケーションができないので、前もって関数 get_num_words で単語数を数え上げます。²

#include <constexpr_doublearray.hpp>

using namespace std::string_view_literals;

// Input words concatenated by NULL character (must be sorted)
constexpr auto text = "ICDE\0"
                      "ICDM\0"
                      "ICDMW\0"
                      "ICML\0"
                      "SIGIR\0"
                      "SIGMOD\0"sv;

// Convert the text to the array of words
constexpr auto num_words = constexpr_doublearray::utils::get_num_words(text);
constexpr auto words = constexpr_doublearray::utils::text_to_words<num_words>(text);

C++17 では std::string_view が constexpr 対応してますので、sv を文字列リテラルに添えるだけで簡単に文字列定数が作れます。素晴らしいですね。未検証ですが、C++14以前では sprout::string で文字列定数が作れるはずです。

ちなみに txt2hpp.py を使えば、お手持ちのテキストデータから文字列定数 text を定義した dataset.hpp を作ることができます。

$ head -5 univs.txt
aichi_bunkyo_university
aichi_gakuin_university
aichi_institute_of_technology
aichi_kiwami_college_of_nursing
aichi_konan_college
$ python txt2hpp.py univs.txt

構築

先ほど作った words を関数 make に渡すことでコンパイル時定数ダブル配列を作成できます。

// Double-array dictionary
constexpr auto capacity = constexpr_doublearray::get_capacity(words);
constexpr auto dict = constexpr_doublearray::make<capacity>(words);

動的アロケーションがコンパイル時にはできないので、ここでも get_capacity で前もってダブル配列の容量を見積もります。

ダブル配列に詳しい方ならご存知でしょうが、ダブル配列はデータセットや構築手順によって空要素を含みます。すなわち、厳密な配列のサイズは実際にダブル配列を作ってみるまでわかりません。これはコンパイル時計算と相性が悪く、get_capacity では words から計算したトライ木のノード数に RESERVE_FACTOR=1.1 を掛けた値をダブル配列の容量として設定しています。なので、データセットによっては capacity が不十分で構築に失敗するかもしれません。その場合は、RESERVE_FACTOR の値を増やし再挑戦して頂く必要があります。

現段階ではこんな残念な実装になっちゃってますが、C++20ではコンパイル時の動的アロケーションに向けて動いているようなので、この辺りの悩みも杞憂に終わるかもしれません（というか終わって）。

onihusube.hatenablog.com

一致検索

コンパイル時定数ダブル配列を用いれば、検索もコンパイル時に楽勝です。

// Exact search
constexpr auto icde_sr = constexpr_doublearray::search("ICDE"sv, dict);
constexpr auto sigmod_sr = constexpr_doublearray::search("SIGMOD"sv, dict);
constexpr auto sigkdd_sr = constexpr_doublearray::search("SIGKDD"sv, dict);

int main() {
    std::cout << "search(ICDE) = " << icde_sr.id << std::endl;
    std::cout << "search(SIGMOD) = " << sigmod_sr.id << std::endl;
    std::cout << "search(SIGKDD) = " << sigkdd_sr.id << std::endl;
}

constexpr_doublearray は登録単語に整数値IDを辞書式順に割り当てます。検索結果は search_result 型で返ってきて、フィールド id にそのIDが格納されます。

上のコードでは出力は以下のようになります。

search(ICDE) = 0
search(SIGMOD) = 5
search(SIGKDD) = 18446744073709551615

逆引き

先程の検索結果から元の文字列もコンパイル時に復元できます。

// Decoding
constexpr auto icde_dec = constexpr_doublearray::decode<icde_sr.depth>(icde_sr.npos, dict);
constexpr auto sigmod_dec = constexpr_doublearray::decode<sigmod_sr.depth>(sigmod_sr.npos, dict);

int main() {
    std::cout << "decode(icde_sr) = " << std::string(std::begin(icde_dec), std::end(icde_dec)) << std::endl;
    std::cout << "decode(sigmod_sr) = " << std::string(std::begin(sigmod_dec), std::end(sigmod_dec)) << std::endl;
}

search_result 型はフィールド npos と depth を持ちます。関数 decode はノード位置 npos からの逆引きで文字列を復元しています。テンプレート引数には復元文字列を格納するためのバッファサイズを渡す必要があり、ノードの深さ depth を渡せば十分なサイズが確保できます。

出力は以下のようになります。

decode(icde_sr) = ICDE
decode(sigmod_sr) = SIGMOD

共通接頭辞検索

お得意の共通接頭辞検索も、もちろんコンパイル時です。

// Common prefix search
constexpr auto icdmw_size = std::size("ICDMW"sv);
constexpr auto icdmw_cpsr = constexpr_doublearray::common_prefix_search<icdmw_size>("ICDMW"sv, dict);

int main() {
    std::cout << "common_prefix_search(ICDMW) = ";
    for (auto r : icdmw_cpsr) {
        const auto dec = constexpr_doublearray::decode<icdmw_size>(r.npos, dict);
        std::cout << "(id=" << r.id << ",str=" << std::string(std::begin(dec), std::end(dec)) << "), ";
    }
    std::cout << std::endl;
}

関数 common_prefix_search は search_result の配列を結果として返します。テンプレート引数には結果を格納するバッファのサイズを渡す必要があり、クエリ文字列長を渡せば十分です。

出力は以下のようになります。

common_prefix_search(ICDMW) = (id=1,str=ICDM), (id=2,str=ICDMW),

補完検索

補完検索もコンパイル時ですね。ありがとうございます。

// Predictive search
constexpr auto sig_psr = constexpr_doublearray::predictive_search<num_words>("SIG"sv, dict);
// Buffer size for decoding
constexpr auto decode_size = constexpr_doublearray::utils::get_max_length(words);

int main() {
    std::cout << "predictive_search(SIG) = ";
    for (auto r : sig_psr) {
        const auto dec = constexpr_doublearray::decode<decode_size>(r.npos, dict);
        std::cout << "(id=" << r.id << ",str=" << std::string(std::begin(dec), std::end(dec)) << "), ";
    }
    std::cout << std::endl;
}

関数 predictive_search でも search_result の配列を結果として返します。ここでもテンプレート引数に結果を格納するバッファのサイズを渡す必要があり、単語数 num_words を渡せば十分ですが、結果に対して大きすぎる場合もあります。バッファサイズが足りない場合は、そのサイズ分だけの結果を返します。

出力は以下のようになります。

predictive_search(SIG) = (id=4,str=SIGIR), (id=5,str=SIGMOD),

コンパイル時計算の制限緩和

コンパイル時間が長くなりすぎる等を抑制するため、多くのコンパイラではコンパイル時の計算回数を制限しています。例えば、Clangでは以下のコンパイルオプションで指定されます。

-fconstexpr-depth: 再帰的な constexpr 関数の呼び出し回数上限（デフォルトは512）
-fconstexpr-steps: 1回の定数式内での評価回数の上限（デフォルトは1048576）

clang.llvm.org

大きいデータセットからダブル配列を構築する場合、デフォルト値では制限に引っかかるかもしれません。例えば、ダブル配列の構築と補完検索ではトライ木を再帰的に辿っており、その最大深度は単語の最大長です。すなわち、-fconstexpr-depth には登録単語の最大長以上の値を設定する必要があります。構築で配列を走査する部分もあるので、-fconstexpr-steps にも十分に大きな値を設定する必要があります。

Clangでこれらに最大値を設定する場合は -fconstexpr-depth=-1 -fconstexpr-steps=-1 とすれば良いようです。ハック的な方法なのでClangのアップデートによっては失敗するかもしれませんが、とりあえずAppleClang 12.0.0ではうまくいっています。

qiita.com

ちなみにGNU コンパイラでは -fconstexpr-steps の代わりに -fconstexpr-loop-limit を指定します。Clangと違い負値を指定すると怒られるので注意しましょう。

gcc.gnu.org

まとめ

というわけでコンパイル時定数ダブル配列を作って遊びました。

今回はC++17でどんなことができるのかを勉強することが目的だったので、Sprout のようなライブラリには敢えて頼りませんでした。C++17では constexpr 対応していないアルゴリズム関数が多いですが、C++20ではそのほとんどが対応しているようです。コンパイル時の動的アロケーションも目指しているようなので、順次移行していきたいです。

今のところ、手元にあった10万単語のデータセットでは（多少時間はかかりますが）問題なく構築できています。今後、コンパイル時間やどれくらいスケールするかなども調べてみたい所存。

役に立つかどうかなんて些細なことです。↩
num_words を外で計算せず、関数 text_to_words 内で行う方法をご存知な方いましたら、コメント等でご教授ください。↩