Anyone Know of any C# alternative to TiKa able to extract text from HTML,PDF, etc..?
Share
Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.
Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.
Lost your password? Please enter your email address. You will receive a link and will create a new password via email.
Please briefly explain why you feel this question should be reported.
Please briefly explain why you feel this answer should be reported.
Please briefly explain why you feel this user should be reported.
I’ve got a similar need… I’ve got a .Net project where I need to pull text out of various files (.XLS, .DOC, .PDF, etc), for indexing with Lucene.Net
This blog post seems to be exactly what I’m after: A .Net wrapper around the .jar file!
I’m implementing it now, but if it doesn’t work then I’ll update my answer here…
Edit: Ok, it’s up, running, and working well (if a little slowly). There’s some pretty nasty dependency wrangling with the IKVM bits, but it’s the best alternative that I’ve found.