I try to parse TPCH files with Boost Spirit QI.
My implementation inspired by the employee example of Spirit QI ( http://www.boost.org/doc/libs/1_52_0/libs/spirit/example/qi/employee.cpp ).
The data is in csv format and the tokens are delimited with a ‘|’ character.
It works but it is very slow (20 sec. for 1 GB).
Here is my qi grammer for the lineitem file:
struct lineitem {
int l_orderkey;
int l_partkey;
int l_suppkey;
int l_linenumber;
std::string l_quantity;
std::string l_extendedprice;
std::string l_discount;
std::string l_tax;
std::string l_returnflag;
std::string l_linestatus;
std::string l_shipdate;
std::string l_commitdate;
std::string l_recepitdate;
std::string l_shipinstruct;
std::string l_shipmode;
std::string l_comment;
};
BOOST_FUSION_ADAPT_STRUCT( lineitem,
(int, l_orderkey)
(int, l_partkey)
(int, l_suppkey)
(int, l_linenumber)
(std::string, l_quantity)
(std::string, l_extendedprice)
(std::string, l_discount)
(std::string, l_tax)
(std::string, l_returnflag)
(std::string, l_linestatus)
(std::string, l_shipdate)
(std::string, l_commitdate)
(std::string, l_recepitdate)
(std::string, l_shipinstruct)
(std::string, l_shipmode)
(std::string, l_comment))
vector<lineitem>* lineitems=new vector<lineitem>();
phrase_parse(state->dataPointer,
state->dataEndPointer,
(*(int_ >> "|" >>
int_ >> "|" >>
int_ >> "|" >>
int_ >> "|" >>
+(char_ - '|') >> "|" >>
+(char_ - '|') >> "|" >>
+(char_ - '|') >> "|" >>
+(char_ - '|') >> "|" >>
+(char_ - '|') >> '|' >>
+(char_ - '|') >> '|' >>
+(char_ - '|') >> '|' >>
+(char_ - '|') >> '|' >>
+(char_ - '|') >> '|' >>
+(char_ - '|') >> '|' >>
+(char_ - '|') >> '|' >>
+(char_ - '|') >> '|'
) ), space, *lineitems
);
The problem seems to be the character parsing. It is much slower than other conversions.
Is there a better way to parse variable length tokens into strings?
I found a solution to my problem. As described in this post Boost Spirit QI grammar slow for parsing delimited strings
the performance bottleneck is the string handling of Spirit qi. All other data types seem to be quite fast.
I avoid this problem through doing the handling of the data on my own instead of using the Spirit qi handling.
My solution uses a helper class which offers functions for every field of the csv file. The functions store the values into a struct. Strings are stored in a char[]s. Hits the parser a newline character it calls a function which adds the struct to the result vector.
The Boost parser calls this functions instead of storing the values into a vector on its own.
Here is my code for the region.tbl file of the TCPH Benchmark:
It is not a pretty solution but it works and it is much faster. ( 2.2 secs for 1 GB TCPH data, multithreaded)