I have a highly structured hierarchical directory containing multiple files that need to be moved into a flat structure and renamed at the same time. The original path and name must be logged along with the new path and name and eventually loaded into a database. Finally, each renamed file must get a unique, unguessable (IE: encrypted or hashed) file name. When the renamed file is moved into the new directory structure, I also want to limit the # of files in each directory, so each directory would be created with a sequential number for its name and then the files would be loaded into it until a maximum number of files was reached (eg: 255) before rolling into a new directory with the next sequential number for its name.
Is there a tool / software that does this? I did some initial research and nothing came up with the following criteria:
- batch rename & copy into alternative (flatter) structure
- hash / encrypt filename and ensure uniqueness
- sequentially name folders and limit file count
- log each file’s original name and path, and new (encrypted) name and path
I have several Bash scripts I have used in the past to migrate hand-made file repositories to hashed repositories to be accessed and managed from a web application (mostly PHP apps). In these repositories filenames are hashed (to avoid collisions with files with the same content/name) and files are distributed evenly (in a deterministic fashion or randomly) to keep files-per-dir count low for performance reasons. The following is one fully-working example:
Just run it from the root of your new repository. You can configure it modifying the first variables: MAXFILESPERDIR defines how many files to store per-directory, TARGETROOTDIR is the name of the first-level directory to create the first level directory (it uses only two levels, the first one is really a single root), and RANDOMDISTRIBUTION defines if the files will be distributed randomly (it may look uneven, specially for small runs) or deterministically (just counting).
How it works (FYI, just in case this is not what you are looking for but maybe you can get some ideas):
If you set RANDOMDISTRIBUTION to 1 and run the script several times, you’ll get duplicates of your source files, as each file will get different target filename/path each time you run it. If RANDOMDISTRIBUTION is set to something else, everytime you run the script the files will be renamed the same way (for the same file set, if you add or remove files, they will get different names/paths).
The objective of using a random value + hash + counter is to be sure we can handle duplicates (won’t collide thanks to the counter) while still distributing the files randomly (for long enough runs, this will distribute the files evenly).
Also, the preffix of the generated file name is the name of the directory too, so that if you have the file name and the directory name length, you can calculate the directory name (just in case you don’t store that in your database table).
Finally, this is a one-time migration script, it was not really written to be executed regularly over the same set of files.