I’ve been playing with hard links lately on a program that copies files all over the place in its directory, deleting all the duplicates and replacing them with hard links. I got that down okay. I understand in the way hard links work that it is just another reference to the data itself on disk. So if I were to access the data from a created hard link, it would look the same.
The problem is finding the real amount of disk space used, which is one of the problems in verifying this is indeed saving space. In other words, if one were to start with a 12K file, create a hard link of that file, then select both in Explorer, it would show as 24K used on disk, not 12K as it really should be.
I know I can query the free space on the disk before and after the process and then compare. But that’s an initial assessment, which is hard to verify after the fact. I know as well that I can use GetFileInformationByHandle to find out whether the file in question has multiple references.
So any ideas here? Would I have to call GetFileInformationByHandle for each file, logging all that data, and then remove the files that have duplicate index references to get an accurate view of how much disk space is actually being used? Or is there an easier way to accomplish this?
Do exactly that. Maintain a set of (dwVolumeSerialNumber, nFileIndexHigh, nFileIndexLow) triples. Each time you encounter a file, check whether you’ve seen it before (i.e., whether its triple is already in your set). If so, then skip it. If not, then add its file size to your total and insert its information into the set.
This unfortunately means you’ll have to open every file. The link count isn’t kept in the directory information, so
FindFirstFilecan’t give it to you. You needGetFileInformationByHandle, and that requires a handle.You might wish to read Raymond Chen’s article on the topic; it mentions several more corner cases besides hard links that could apply to your application, including reparse points, cluster rounding, and alternate data streams.
You could try to reduce the effort required in maintaining the set by only tracking files that have link counts greater than 1. Files with only one link shouldn’t appear multiple times in your directory traversal. That assumes you’ll only see each directory exactly once. Reparse points and junctions can make the assumption invalid, so if you try to reduce the size of your “files seen” set, you’ll need to also keep track of which directories you’ve already seen.