I am writing a classifier for categorizing whether a special deal is for a restaurant/hotel/etc… This is part of a web-crawler for analyzing external sites.
For start I made a meal?() method, which accepts a piece of text and will return true if it think the text is about a meal deal. It can’t be 100% accurate, since only simple keyword matching is used.
def meal?(text)
!text.match(/restaurant|meal|wine|.../i).nil?
end
Now I am writing a test for it, and I have two questions. The first one is that I think it is a bit redundant to re-list all of these keywords in the unit test again. What do you think?
The second question:
I have an .html file in source control. It is used to test the crawler’s parsing functionality. Theoretically all of its items should pass, so I am thinking to use that html in this categorizing test, parse that html and feed the descriptions of each deal into this method.
One drawback is that the .html is taken from an external site. When that site changes layout I will update this .html file, and then I have to change this categorizing test too. But I think this is okay.
Is this recommended? I thought of this way because I feels uneasy extracting information out of that .html and place it in the test script itself (not DRY, and makes test script quite big). Would feeding the parsed description violate any fundamental testing laws, like ‘this hides the necessary details away from developers’ or ‘this is bad for generating reports’?
OK so I obviously misunderstood the question so I will revise this answer completely.
I personally think it is simpler and preferable to take the actual text from the html file and copy/paste it to the test as opposed to the indirection of loading an html file. Two reasons I can find…
I cannot however find a reason why what you are trying to do is really really bad, I think it boils down to personal preference.