I need to perform a regular expression search for a string x in another string y, but I then need to know the token (word) index of the first character of the hit after tokenizing (splitting) string y using some other regular expression (e.g. white space). The first regular expression might find a substring, so I cannot guarantee that it will stop at the beginning of the token (word).
What would be the best algorithm to implement this. A simple approach would be the following:
- Search for x in y using the first regular expression and get the character offset z
- Split y into an array of elements using the second regular expression
- Loop through the array of elements adding the length of each item to a variable LENGTH and adding 1 to a counter COUNTER
- Stop the loop when LENGTH is greater or equal to z
- The index of the token of the first character of the hit will be the value of COUNTER
(This assumes that the split function stores the splitting characters (e.g. white space) as array elements, which is very wasteful.
A concrete (simple) example: Suppose I want to know the token (word) index for the search “ade” in the string “The moon is made of cheese”. The function should give me back the answer: 3 (for zero indexed arrays).
==Edit==
The algorithm also needs to work when the regex search crosses token boundaries. For example, it should again return the index “3” when searching for “de of ch” in “The moon is made of cheese”.
According to your updates:
output: