I need detect words that separate with space in a text. for example my text is:
some parent +kid -control "human right" world
now I need to detect some, parent, world. (all words that don’t have + – ( ) < > before and after, and all words inside quotes must be discarded) so I write this regex with preg_match_all():
(?:^|[\s]+)((?:(?![\+\(\)\<\>\s\-\"]).)+)(?:[\s]+|$)
but it only detect some and world. how can I fix it?
EDIT
I need it for Javascript too. But it seems it doesn’t work for Javascript. how can I do it with javascript?
EDIT
I found a solution but it seems stupid way. what is your ideas?
$str = 'some parent +kid -control "my human right" world';
$words=array();
$quot=false;
$discard=false;
$word='';
for($i=0;$i<=strlen($str);$i++){
$chr=substr($str,$i,1);
if($chr=='"'){
if($quot){
$quot=false;
}else{
$quot=true;
}
continue;
}
if($quot)continue;
if($chr==' '||$i==strlen($str)){
if(strlen($word)&&!$discard)$words[]=$word;
$discard=false;
$word='';
continue;
}elseif(in_array($chr,array('+','-','(',')','<','>'))){
$discard=true;
continue;
}
$word.=$chr;
}
print_r($words);//Array ( [0] => some [1] => parent [2] => world )
EDIT
Final way for PHP (this is for multi-language queries) (special thanks to rubber boots):
$query='some parent +kid -control "my human right" world';
$result=array();
if(preg_match_all('/(?:"[^"]+")|(?:^|[\s])(?P<q>(?:(?![\+\(\)\<\>\s\-\"]).)+)/',$query,$match)){
$result=array_filter($match['q'],'strlen');
}
print_r($result);// some,parent,world
Final way for javascript (this is for multi-language queries) (special thanks to rubber boots):
var query='some parent +kid -control "my human right" world';
var result=Array();
var tmp;
var patt=RegExp('(?:"[^"]+")|(?:(?:^|\\s)((?:(?![\\+\\(\\)\\<\\>\\s\\-\\"]).)+))', 'g');
while(tmp = patt.exec(query)){
if(typeof(tmp[1])!=='undefined') result.push(tmp[1]);
}
alert(result);// some,parent,world
If the following string is given:
it’s possible to extract words according to your specification with a rather simple expression too:
This results in:
The technique used:
The expr
(?:" [^"]+ ")?consumes the quotes and their contents.Addendum: Javascript
For Javascript, you need to use a slightly more complicated approach, Javascript has no
lookbehind assertions, we fake them with(?:^|\\s)in front of an allowed word.This will work:
We use the same technique here – generate captured submatches in
$1for the words we need.The contents of the array
a, (document.getElementById("myhtml").innerHTML = a;) will contain then: