Interview Question
I have been asked this question in an interview, and the answer doesn’t have to be specific programming language, platform- or tool- specific.
The question was phrased as following:
How would you get the instance count of a given word in a PDF. The answer doesn’t have to be programming, platform, or tool specific. Just let me know how would you do it in a memory and speed efficient way
I am posting this question for following reasons:
- To better understand the context – I still fail to understand the context of this question, what might the interviewer be looking for by asking this question?
- To get diverse opinions – I tend to answer such questions based on my skills on a programming language (C#), but there might be other valid options to get this done.
Thanks for your interest.
If I had to write a program to do it, I’d find a PDF rendering library capable of extracting text from PDF files, such as Xpdf and then count the words.
If this was a one-of task or something that needed to be automated for a non-production quality task, I’d just feed the file into pdftotext program and then parsed the output file with python, splitting into words, putting them in a dictionary and counting number of occurances.
If I was asking this interviewing question, I’d be looking for a couple of things:
one-off script thingy vs production code
implement PDF rendered yourself and trying to find a library
instead.
Now I wouldn’t expect this from any random candidate with no PDF experience, but you can have a very meaningful discussion about what PDF is and what a “word” is. You see, PDF stored text as a bunch of string with coordinates. Each string is not necessarily a word. Often times, the words will be split into a couple of completely separate strings which are absolutely positioned in the document to make a single word. This is why sometimes when searching for words in a PDF document you get strange looking results. So to implement word searching in a document you’d have to glue these strings back together (pdftotext takes care of that for you).
It’s not a bad question at all.