I’m using Mathematica to work with a large array of website files, which I’ve mirrored onto my own system. They are spread across several hundred directories, with tons of sub-directories. So for example, I have:
/users/me/test/directory1
/users/me/test/directory1/subdirectory2 [times a hundred]
/users/me/test/directory2
/users/me/test/directory2/subdirectory5 [etc. etc.]
What I need to do is to go into each directory, Import[] all the HTML files as Plaintext, and then put them in another directory elsewhere on my system named after ‘directory1’. So far, with Do[] loops I have been able to do a rough version: the best case I have right now, however, is dumping the “.txt” files in the original directory, which isn’t an ideal solution as they’re still spread all over my system.
To find my files, I use directoryfiles = FileNames["*.htm*", {"*"}, Infinity];
Some additional vexing problems:
(1) Duplicates: Is there a way for Mathematica to deal with duplicates – i.e. if we run into another index_en.html can it be renamed as index_en_1.html?
(2) Directories: Because of all the directories, unless I use Mathematica to constantly SetDirectory and CreateDirectory over and over again, it keeps running into trouble.
This all seems a bit confusing. Basically, is there an efficient way for Mathematica to find a ton of HTML files spread across hundreds of directories/subdirectories, Import them as plaintext, and export them somewhere else [it’s important for me to know they came from directory1, but that’s it].
— edited for clarity below —
Here is the code that I currently have:
SetDirectory[
"/users/me/web/"];
dirlist = FileNames[];
directoryPrefix =
"/users/me/web/";
plainHTMLBucket = "";
Do[
directory = directoryPrefix <> dirname;
exportPrefix =
"/users/me/desktop/bucket/";
SetDirectory[directory];
allFiles = FileNames["*.htm*", {"*"}, Infinity];
plainHTMLBucket = "";
Do[
plainHTML = Import[filename, "Plaintext"];
plainHTMLBucket = AppendTo[plainHTMLBucket, plainHTML];
, {filename, allFiles}];
Export[exportPrefix <> dirname <> ".txt", plainHTMLBucket];
Print["We Have Reached Here"];
, {dirname, dirlist}];
What’s wrong with it from my perspective? Besides being messy, it’s my workaround: I would much rather have all the files separated rather than one big one – i.e. take each import and export as a separate file, but in a directory called ‘directory1’ albeit somewhere else. The problem is when it comes to mirroring these directories (the directories don’t exist, but I am having trouble using CreateDirectory[] to dynamically do so).
My apologies for the confusion here – I know it shows with this question..
The following code might do the trick:
Example use (warning: the target directory will be deleted first!):
How It Works
The various Mathematica
FileName...functions are helpful in this context. First, we start by defining the helper functionmapFileNamesthat takes a source directory, a list of file names that lie within the source directory, and a target directory. It returns a list of file paths that name the corresponding locations underneath the target directory.The function uses
FileNameDropto drop the leading source path elements from each filename andFileNameJointo prepend the target path onto the front of each result. The number of leading elements to drop is determined by applyingFileNameDepthto the source path.For example:
Using this function, we can convert a list of HTML file paths under a source directory (
source) into corresponding list of text file paths under the target directory (target):These statements retrieve the list of HTML files, map them to the target directory, and then change the file extension from
.htmlto.txt. We can now extract the necessary directory names from the resulting text files:Again
FileNameDropis used, this time to drop the filename portion from each text file’s path.Next, we need to delete the target directory (if it already exists) and create the new required directories:
We can now perform the HTML-to-text transformation, safe in the knowledge that the target directories already exist: