I’m using Mathematica to work with a large array of website files, which I’ve

Question

0

Asked: May 26, 20262026-05-26T20:52:35+00:00 2026-05-26T20:52:35+00:00

I’m using Mathematica to work with a large array of website files, which I’ve

0

I’m using Mathematica to work with a large array of website files, which I’ve mirrored onto my own system. They are spread across several hundred directories, with tons of sub-directories. So for example, I have:

/users/me/test/directory1
/users/me/test/directory1/subdirectory2 [times a hundred]
/users/me/test/directory2
/users/me/test/directory2/subdirectory5 [etc. etc.]

What I need to do is to go into each directory, Import[] all the HTML files as Plaintext, and then put them in another directory elsewhere on my system named after ‘directory1’. So far, with Do[] loops I have been able to do a rough version: the best case I have right now, however, is dumping the “.txt” files in the original directory, which isn’t an ideal solution as they’re still spread all over my system.

To find my files, I use directoryfiles = FileNames["*.htm*", {"*"}, Infinity];

Some additional vexing problems:

(1) Duplicates: Is there a way for Mathematica to deal with duplicates – i.e. if we run into another index_en.html can it be renamed as index_en_1.html?

(2) Directories: Because of all the directories, unless I use Mathematica to constantly SetDirectory and CreateDirectory over and over again, it keeps running into trouble.

This all seems a bit confusing. Basically, is there an efficient way for Mathematica to find a ton of HTML files spread across hundreds of directories/subdirectories, Import them as plaintext, and export them somewhere else [it’s important for me to know they came from directory1, but that’s it].

— edited for clarity below —

Here is the code that I currently have:

SetDirectory[
  "/users/me/web/"];
dirlist = FileNames[];
directoryPrefix = 
  "/users/me/web/";
plainHTMLBucket = "";
Do[
  directory = directoryPrefix <> dirname;
  exportPrefix = 
   "/users/me/desktop/bucket/";
  SetDirectory[directory];
  allFiles = FileNames["*.htm*", {"*"}, Infinity];
  plainHTMLBucket = "";
  Do[
   plainHTML = Import[filename, "Plaintext"];
   plainHTMLBucket = AppendTo[plainHTMLBucket, plainHTML];
   , {filename, allFiles}];
  Export[exportPrefix <> dirname <> ".txt", plainHTMLBucket];
  Print["We Have Reached Here"];
  , {dirname, dirlist}];

What’s wrong with it from my perspective? Besides being messy, it’s my workaround: I would much rather have all the files separated rather than one big one – i.e. take each import and export as a separate file, but in a directory called ‘directory1’ albeit somewhere else. The problem is when it comes to mirroring these directories (the directories don’t exist, but I am having trouble using CreateDirectory[] to dynamically do so).

My apologies for the confusion here – I know it shows with this question..

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-05-26T20:52:36+00:00

The following code might do the trick:

mapFileNames[source_, filenames_, target_] :=
  Module[{depth = FileNameDepth[source]}
  , FileNameJoin[{target, FileNameDrop[#, depth]}]& /@ filenames
  ]

htmlTreeToPlainText[source_, target_] :=
  Module[{htmlFiles, textFiles, targetDirs}
  , htmlFiles = FileNames["*.html", source, Infinity]
  ; textFiles = StringReplace[
                  mapFileNames[source, htmlFiles, target]
                  , f__~~".html"~~EndOfString :> f~~".txt"
                  ]
  ; targetDirs = DeleteDuplicates[FileNameDrop[#, -1]& /@ textFiles]
  ; If[FileExistsQ[target], DeleteDirectory[target, DeleteContents -> True]]
  ; Scan[CreateDirectory[#, CreateIntermediateDirectories -> True]&, targetDirs]
  ; Scan[
      Export[#[[2]], Import[#[[1]], "Plaintext"], "Text"] &
    , Transpose[{htmlFiles, textFiles}]
    ]
  ]

Example use (warning: the target directory will be deleted first!):

htmlTreeToPlainText["/users/me/web", "/users/me/desktop/bucket"]

How It Works

The various Mathematica FileName... functions are helpful in this context. First, we start by defining the helper function mapFileNames that takes a source directory, a list of file names that lie within the source directory, and a target directory. It returns a list of file paths that name the corresponding locations underneath the target directory.

mapFileNames[source_, filenames_, target_] :=
  Module[{depth = FileNameDepth[source]}
  , FileNameJoin[{target, FileNameDrop[#, depth]}]& /@ filenames
  ]

The function uses FileNameDrop to drop the leading source path elements from each filename and FileNameJoin to prepend the target path onto the front of each result. The number of leading elements to drop is determined by applying FileNameDepth to the source path.

For example:

In[83]:= mapFileNames["/a/b", {"/a/b/x.txt", "/a/b/c/y.txt"}, "/d"]
Out[83]= {"/d/x.txt", "/d/c/y.txt"}

Using this function, we can convert a list of HTML file paths under a source directory (source) into corresponding list of text file paths under the target directory (target):

htmlFiles = FileNames["*.html", source, Infinity]

textFiles = StringReplace[
              mapFileNames[source, htmlFiles, target]
              , f__~~".html"~~EndOfString :> f~~".txt"
              ]

These statements retrieve the list of HTML files, map them to the target directory, and then change the file extension from .html to .txt. We can now extract the necessary directory names from the resulting text files:

targetDirs = DeleteDuplicates[FileNameDrop[#, -1]& /@ textFiles]

Again FileNameDrop is used, this time to drop the filename portion from each text file’s path.

Next, we need to delete the target directory (if it already exists) and create the new required directories:

If[FileExistsQ[target], DeleteDirectory[target, DeleteContents -> True]]

Scan[CreateDirectory[#, CreateIntermediateDirectories -> True]&, targetDirs]

We can now perform the HTML-to-text transformation, safe in the knowledge that the target directories already exist:

Scan[
  Export[#[[2]], Import[#[[1]], "Plaintext"], "Text"] &
, Transpose[{htmlFiles, textFiles}]
]

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I’m using Mathematica to work with a large array of website files, which I’ve

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply