I need a Regex to detect questions within a text.
Example input:
please, tell me how to do this… or how to make it right! and so on….
I need output:
- how to do this
- how to make it right
Now I use this:
(?<q>(how to|how match|how many).*)(\s|\.|;|!|\?|( \-)|(\- )|‾|:|…|_|\||@|~|…|–|—|¯|»|•|●|{|}|\(|\)|\\|\]|\[|>|<|→|'|""|`|$) but does not work
I need only how to questions
The task you are trying to accomplish falls under a different category than what regular expressions are good for.
To solve the problem of extracting arbitrary questions from text you need a lot more than just a few good regular expressions. You should start looking at a good natural language processing toolkit. And maybe first do some Part of Speech tagging. Then, from there you will need to do some syntax and sentence parsing and then move on to try to answer the question of: “Is this a sentence a question?” by examining each sentence your NLP pipeline will have identified.
Armed with this knowledge, at a minimum, you should understand that the task you want to accomplish is rather difficult and while not impossible will require a lot of fine tuning to get good performance (usually measured with Accuracy and Precision metrics). You will most likely not get anywhere near 100% on either but you should be able to get decent results with a good PoS tagger and a good sentence parser.
EDIT
In light of your recent edit to the question, you may be able to get some basic coverage with RegExs and hand-written rules but you will still fail to differentiate many more complicated cases. The natural language processing toolkit route is still preferred for a more generic solution.
Don’t spend too much time trying to come up with a silver bullet regular expression to match natural language. Natural language is not regular – so it’s not going to work! It’s ok to use regular expressions to identify some keywords but beyond that you’re better off with simple hand-written rules, and tokenizing in lue of a good natural language pipeline..
UPDATE
If you’re really serious about this task, have a look at sharpnlp.codeplex.com as a starting point. There are other NLP tool-kits out there with NLTK springing to mind as a popular one if you’re not required to use C#. As a second step, get yourself an introductory book on NLP. The subject is vast and really cool. A great book I’ve learned a lot from would be: Speech and Language Processing by Jurafski and Martin.
And as a final thought, here’s what I would do at a minimum:
. , ; ? !)Good luck!