I have a text document and a query (the query could be more than one word). I want to find the position of all occurrences of the query in the document.
I thought of the documentText.indexOf(query) or using regular expression but I could not make it work.
I end up with the following method:
First, I have create a dataType called QueryOccurrence
public class QueryOccurrence implements Serializable{
public QueryOccurrence(){}
private int start;
private int end;
public QueryOccurrence(int nameStart,int nameEnd,String nameText){
start=nameStart;
end=nameEnd;
}
public int getStart(){
return start;
}
public int getEnd(){
return end;
}
public void SetStart(int i){
start=i;
}
public void SetEnd(int i){
end=i;
}
}
Then, I have used this datatype in the following method:
public static List<QueryOccurrence>FindQueryPositions(String documentText, String query){
// Normalize do the following: lower case, trim, and remove punctuation
String normalizedQuery = Normalize.Normalize(query);
String normalizedDocument = Normalize.Normalize(documentText);
String[] documentWords = normalizedDocument.split(" ");;
String[] queryArray = normalizedQuery.split(" ");
List<QueryOccurrence> foundQueries = new ArrayList();
QueryOccurrence foundQuery = new QueryOccurrence();
int index = 0;
for (String word : documentWords) {
if (word.equals(queryArray[0])){
foundQuery.SetStart(index);
}
if (word.equals(queryArray[queryArray.length-1])){
foundQuery.SetEnd(index);
if((foundQuery.End()-foundQuery.Start())+1==queryArray.length){
//add the found query to the list
foundQueries.add(foundQuery);
//flush the foundQuery variable to use it again
foundQuery= new QueryOccurrence();
}
}
index++;
}
return foundQueries;
}
This method return a list of all occurrence of the query in the document each one with its position.
Could you suggest any easer and faster way to accomplish this task.
Thanks
Your first approach was a good idea, but String.indexOf does not support regular expressions.
Another easier way which uses a similar approach, but in a two step method, is as follows:
Where positions will hold all the start positions of the matches.