I try to parse TPCH files with Boost Spirit QI. My implementation inspired by

Question

0

Asked: June 14, 20262026-06-14T17:20:32+00:00 2026-06-14T17:20:32+00:00

I try to parse TPCH files with Boost Spirit QI. My implementation inspired by

0

I try to parse TPCH files with Boost Spirit QI.
My implementation inspired by the employee example of Spirit QI ( http://www.boost.org/doc/libs/1_52_0/libs/spirit/example/qi/employee.cpp ).
The data is in csv format and the tokens are delimited with a ‘|’ character.

It works but it is very slow (20 sec. for 1 GB).

Here is my qi grammer for the lineitem file:

struct lineitem {
    int l_orderkey;
    int l_partkey;
    int l_suppkey;
    int l_linenumber;
    std::string l_quantity;
    std::string l_extendedprice;
    std::string l_discount;
    std::string l_tax;
    std::string l_returnflag;
    std::string l_linestatus;
    std::string l_shipdate;
    std::string l_commitdate;
    std::string l_recepitdate;
    std::string l_shipinstruct;
    std::string l_shipmode;
    std::string l_comment;
};

BOOST_FUSION_ADAPT_STRUCT( lineitem,
    (int, l_orderkey)
    (int, l_partkey)
    (int, l_suppkey)
    (int, l_linenumber)
    (std::string, l_quantity)
    (std::string, l_extendedprice)
    (std::string, l_discount)
    (std::string, l_tax)
    (std::string, l_returnflag)
    (std::string, l_linestatus)
    (std::string, l_shipdate)
    (std::string, l_commitdate)
    (std::string, l_recepitdate)
    (std::string, l_shipinstruct)
    (std::string, l_shipmode)
    (std::string, l_comment)) 

vector<lineitem>* lineitems=new vector<lineitem>();

phrase_parse(state->dataPointer,
    state->dataEndPointer,
    (*(int_ >> "|" >>
    int_ >> "|" >> 
    int_ >> "|" >>
    int_ >> "|" >>
    +(char_ - '|') >> "|" >>
    +(char_ - '|') >> "|" >>
    +(char_ - '|') >> "|" >>
    +(char_ - '|') >> "|" >>
    +(char_ - '|') >> '|' >>
    +(char_ - '|') >> '|' >>
    +(char_ - '|') >> '|' >>
    +(char_ - '|') >> '|' >>
    +(char_ - '|') >> '|' >>
    +(char_ - '|') >> '|' >>
    +(char_ - '|') >> '|' >>
    +(char_ - '|') >> '|' 
    ) ), space, *lineitems
);

The problem seems to be the character parsing. It is much slower than other conversions.
Is there a better way to parse variable length tokens into strings?

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-06-14T17:20:33+00:00

I found a solution to my problem. As described in this post Boost Spirit QI grammar slow for parsing delimited strings
the performance bottleneck is the string handling of Spirit qi. All other data types seem to be quite fast.

I avoid this problem through doing the handling of the data on my own instead of using the Spirit qi handling.

My solution uses a helper class which offers functions for every field of the csv file. The functions store the values into a struct. Strings are stored in a char[]s. Hits the parser a newline character it calls a function which adds the struct to the result vector.
The Boost parser calls this functions instead of storing the values into a vector on its own.

Here is my code for the region.tbl file of the TCPH Benchmark:

struct region{
    int r_regionkey;
    char r_name[25];
    char r_comment[152];
};

class regionStorage{
public:
regionStorage(vector<region>* regions) :regions(regions), pos(0) {}
void storer_regionkey(int const&i){
    currentregion.r_regionkey = i;
}

void storer_name(char const&i){
    currentregion.r_name[pos] = i;
    pos++;
}

void storer_comment(char const&i){
    currentregion.r_comment[pos] = i;
    pos++;
}

void resetPos() {
    pos = 0;
}

void endOfLine() {
    pos = 0;
    regions->push_back(currentregion);
}

private:
vector<region>* regions;
region currentregion;
int pos;
};


void parseRegion(){

    vector<region> regions;
    regionStorage regionstorageObject(&regions);
    phrase_parse(dataPointer, /*< start iterator >*/    
     state->dataEndPointer, /*< end iterator >*/
     (*(lexeme[
     +(int_[boost::bind(&regionStorage::storer_regionkey, &regionstorageObject, _1)] - '|') >> '|' >>
     +(char_[boost::bind(&regionStorage::storer_name, &regionstorageObject, _1)] - '|') >> char_('|')[boost::bind(&regionStorage::resetPos, &regionstorageObject)] >>
     +(char_[boost::bind(&regionStorage::storer_comment, &regionstorageObject, _1)] - '|') >> char_('|')[boost::bind(&regionStorage::endOfLine, &regionstorageObject)]
    ])), space);

   cout << regions.size() << endl;
}

It is not a pretty solution but it works and it is much faster. ( 2.2 secs for 1 GB TCPH data, multithreaded)

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I try to parse TPCH files with Boost Spirit QI. My implementation inspired by

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply