I am new to hadoop and mapreduce. In mapper i am going to tokenize these data from the text file with below given format with first few lines
9593C58F7C1C5CE4 970916072134 levis
9593C58F7C1C5CE4 970916072311 levis strause & co
9593C58F7C1C5CE4 970916072339 levis 501 jeans
45531846E8E7C127 970916065859
45531846E8E7C127 970916065935
45531846E8E7C127 970916070105 "brazillian soccer teams"
45531846E8E7C127 970916070248 "brazillian soccer"
45531846E8E7C127 970916071154 "population of maldives"
082A665972806A62 970916123431 pegasus
F6C8FFEAA26F1778 970916070130 "alicia silverstone" cutest crush batgirl babysitter clueless
945FF0D5996FD556 970916142859 mirc
With String Tokenizer I am unable to split these data, its confuses to the machine to catching data from this file.Is there any alternative for this problem except String.split()
@ Hanry : Why don’t you use the same java StringTokenizer. All you have to do is to tokenize wrt space, get the total token count and then iterate through and use First and Second tokens as such and concatenate subsequent tokens into a third string.