I have often seen tests where canned inputs are fed into a program, one checks the outputs generated against canned (expected) outputs usually via diff. If the diff is accepted, the code is deemed to pass the test.
Questions:
1) Is this an acceptable unit test?
2) Usually the unit test inputs are read in from the file system and are big
xml files (maybe they represent a very large system). Are unit tests supposed
to touch the file system? Or would a unit test create a small input on the fly
and feed that to the code to be tested?
3) How can one refactor existing code to be unit testable?
Output differences
If your requirement is to produce output with certain degree of accuracy, then such tests are absolutely fine. It’s you who makes the final decision – “Is this output good enough, or not?”.
Talking to file system
You don’t want your tests to talk to file system in terms of relying on some files to exists somewhere in order for your tests to work (for example, reading values from configuration files). It’s a bit different with tests input resources – you can usually embed them in your tests (or at least test project), treat them as part of codebase, and on top of that they usually should be loaded before test executes. For example, when testing rather large XMLs it’s reasonable to have them stored as separete files, rather than strings in code files (which sometimes can be done instead).
Point is – you want to keep your tests isolated and repeatable. If you can achieve that with file being loaded at runtime – it’s probably fine. However it’s still better to have them as part of codebase/resources than standard system file lying somewhere.
Refactoring
This question is fairly broad, but to put you in the right direction – you want to introduce more solid design, decouple objects and separate responsibilities. Better design will make testing easier and, what’s most important – possible. Like I said, it’s broad and complex topic, with entire books dedicated to it.