I am looking at the best possible approach to search and replace for a “group of strings” in an another String. The group of strings are constant [around 150 strings]. The text to search in is dynamic [around 10000 characters, nearly 2000 words)
Group 1 : {“foo”,”duck”,”man”…..,”xyz”) [ fixed set – O(150)]
Group 2 : “My name is foo. I have a duck” [dynamic text – O(2000)]
Input Text : My name is foo. I have a duck.
Expected Output Text : My name is *. I have a *.
The best approach i could think of is…
1) convert group 1 into a HashSet
2) convert the dynamic text into a String[]
3) Loop through the String[] and check if the string exists in the hashset.
for(int i = 0; i < String[].length; i++){
if(HashSet.contains(String[][i]))
//Replace the string in the text
}
Any better alternatives?
Please share your thoughts…
UPDATED
This is the final code with the output to replace group of strings in an another String. (using regex)
public class StringReplacementTest
{
private static final String[] restricted_words_list={"foo","duck","man","xyz"};
private static final String[] not_restricted_words_list={"zoo","book","cool"};
private static final Pattern restrictedReplacer;
private static final Pattern nonRestrictedReplacer;
private static Set<String> restrictedWords = null;
private static List<String> nonRestrictedWords = null;
static {//done once only
StringBuilder strb= new StringBuilder();
for(String str:restricted_words_list){
strb.append("\\b").append(Pattern.quote(str)).append("\\b|");
//using word break to avoid ***umptions;
}
strb.setLength(strb.length()-1);
restrictedReplacer = Pattern.compile(strb.toString(),Pattern.CASE_INSENSITIVE);
strb = new StringBuilder();
for(String str:not_restricted_words_list){
strb.append("\\b").append(Pattern.quote(str)).append("\\b|");
}
strb.setLength(strb.length()-1);
nonRestrictedReplacer = Pattern.compile(strb.toString(),Pattern.CASE_INSENSITIVE);
}
/**
* @param args
*/
public static void main(String[] args)
{
String inputText = "My name is foo. I have a duck.. not ducks. I am FOO and the duckz at the zoo. i read book and COOL";
System.out.println("inputText : " + inputText);
String modifiedText = restrictedWordCheck(inputText);
modifiedText = nonRestrictWordCheck(modifiedText);
System.out.println("Modified Text : " + modifiedText);
System.out.println("List of restricted Words" + restrictedWords);
System.out.println("List of non-restricted words" + nonRestrictedWords);
}
public static String restrictedWordCheck(String input){
Matcher m = restrictedReplacer.matcher(input);
StringBuffer strb = new StringBuffer(input.length());//ensuring capacity
while(m.find()){
if(restrictedWords==null)restrictedWords = new HashSet<String>();
restrictedWords.add(m.group()); //m.group() returns what was matched
m.appendReplacement(strb,""); //this writes out what came in between matching words
for(int i=m.start();i<m.end();i++)
strb.append("*");
}
m.appendTail(strb);
return strb.toString();
}
public static String nonRestrictWordCheck(String input){
Matcher m = nonRestrictedReplacer.matcher(input);
while(m.find()){
if(nonRestrictedWords==null)nonRestrictedWords = new ArrayList<String>();
nonRestrictedWords.add(m.group());
}
return m.replaceAll("<b>$0</b>");
}
}
OUTPUT
inputText : My name is foo. I have a duck.. not ducks. I am FOO and the duckz at the zoo. i read book and COOL
Modified Text : My name is . I have a *.. not ducks. I am * and the duckz at the zoo. i read book and COOL
List of restricted Words[duck, foo, FOO]
List of non-restricted words[zoo, book, COOL]
Any advice to further optimize the implementation is welcome 🙂
Thanks
use a precompiled
Patternyou can get more complicated in what you want to replace it with
surrounding the word with
<b>tags:replacer.matcher(in).replaceAll("<b>$0</b>");($0refers to the whole match)but if you want to say match the length of the matched string you’ll have to loop it explicitly:
but if you want to be assured of optimal runtime you can build a trie and run the long string on that