Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • Home
  • SEARCH
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 6816021
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: May 26, 20262026-05-26T20:52:35+00:00 2026-05-26T20:52:35+00:00

I’m using Mathematica to work with a large array of website files, which I’ve

  • 0

I’m using Mathematica to work with a large array of website files, which I’ve mirrored onto my own system. They are spread across several hundred directories, with tons of sub-directories. So for example, I have:

/users/me/test/directory1
/users/me/test/directory1/subdirectory2 [times a hundred]
/users/me/test/directory2
/users/me/test/directory2/subdirectory5 [etc. etc.]

What I need to do is to go into each directory, Import[] all the HTML files as Plaintext, and then put them in another directory elsewhere on my system named after ‘directory1’. So far, with Do[] loops I have been able to do a rough version: the best case I have right now, however, is dumping the “.txt” files in the original directory, which isn’t an ideal solution as they’re still spread all over my system.

To find my files, I use directoryfiles = FileNames["*.htm*", {"*"}, Infinity];

Some additional vexing problems:

(1) Duplicates: Is there a way for Mathematica to deal with duplicates – i.e. if we run into another index_en.html can it be renamed as index_en_1.html?

(2) Directories: Because of all the directories, unless I use Mathematica to constantly SetDirectory and CreateDirectory over and over again, it keeps running into trouble.

This all seems a bit confusing. Basically, is there an efficient way for Mathematica to find a ton of HTML files spread across hundreds of directories/subdirectories, Import them as plaintext, and export them somewhere else [it’s important for me to know they came from directory1, but that’s it].

— edited for clarity below —

Here is the code that I currently have:

SetDirectory[
  "/users/me/web/"];
dirlist = FileNames[];
directoryPrefix = 
  "/users/me/web/";
plainHTMLBucket = "";
Do[
  directory = directoryPrefix <> dirname;
  exportPrefix = 
   "/users/me/desktop/bucket/";
  SetDirectory[directory];
  allFiles = FileNames["*.htm*", {"*"}, Infinity];
  plainHTMLBucket = "";
  Do[
   plainHTML = Import[filename, "Plaintext"];
   plainHTMLBucket = AppendTo[plainHTMLBucket, plainHTML];
   , {filename, allFiles}];
  Export[exportPrefix <> dirname <> ".txt", plainHTMLBucket];
  Print["We Have Reached Here"];
  , {dirname, dirlist}];

What’s wrong with it from my perspective? Besides being messy, it’s my workaround: I would much rather have all the files separated rather than one big one – i.e. take each import and export as a separate file, but in a directory called ‘directory1’ albeit somewhere else. The problem is when it comes to mirroring these directories (the directories don’t exist, but I am having trouble using CreateDirectory[] to dynamically do so).

My apologies for the confusion here – I know it shows with this question..

  • 1 1 Answer
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-05-26T20:52:36+00:00Added an answer on May 26, 2026 at 8:52 pm

    The following code might do the trick:

    mapFileNames[source_, filenames_, target_] :=
      Module[{depth = FileNameDepth[source]}
      , FileNameJoin[{target, FileNameDrop[#, depth]}]& /@ filenames
      ]
    
    htmlTreeToPlainText[source_, target_] :=
      Module[{htmlFiles, textFiles, targetDirs}
      , htmlFiles = FileNames["*.html", source, Infinity]
      ; textFiles = StringReplace[
                      mapFileNames[source, htmlFiles, target]
                      , f__~~".html"~~EndOfString :> f~~".txt"
                      ]
      ; targetDirs = DeleteDuplicates[FileNameDrop[#, -1]& /@ textFiles]
      ; If[FileExistsQ[target], DeleteDirectory[target, DeleteContents -> True]]
      ; Scan[CreateDirectory[#, CreateIntermediateDirectories -> True]&, targetDirs]
      ; Scan[
          Export[#[[2]], Import[#[[1]], "Plaintext"], "Text"] &
        , Transpose[{htmlFiles, textFiles}]
        ]
      ]
    

    Example use (warning: the target directory will be deleted first!):

    htmlTreeToPlainText["/users/me/web", "/users/me/desktop/bucket"]
    

    How It Works

    The various Mathematica FileName... functions are helpful in this context. First, we start by defining the helper function mapFileNames that takes a source directory, a list of file names that lie within the source directory, and a target directory. It returns a list of file paths that name the corresponding locations underneath the target directory.

    mapFileNames[source_, filenames_, target_] :=
      Module[{depth = FileNameDepth[source]}
      , FileNameJoin[{target, FileNameDrop[#, depth]}]& /@ filenames
      ]
    

    The function uses FileNameDrop to drop the leading source path elements from each filename and FileNameJoin to prepend the target path onto the front of each result. The number of leading elements to drop is determined by applying FileNameDepth to the source path.

    For example:

    In[83]:= mapFileNames["/a/b", {"/a/b/x.txt", "/a/b/c/y.txt"}, "/d"]
    Out[83]= {"/d/x.txt", "/d/c/y.txt"}
    

    Using this function, we can convert a list of HTML file paths under a source directory (source) into corresponding list of text file paths under the target directory (target):

    htmlFiles = FileNames["*.html", source, Infinity]
    
    textFiles = StringReplace[
                  mapFileNames[source, htmlFiles, target]
                  , f__~~".html"~~EndOfString :> f~~".txt"
                  ]
    

    These statements retrieve the list of HTML files, map them to the target directory, and then change the file extension from .html to .txt. We can now extract the necessary directory names from the resulting text files:

    targetDirs = DeleteDuplicates[FileNameDrop[#, -1]& /@ textFiles]
    

    Again FileNameDrop is used, this time to drop the filename portion from each text file’s path.

    Next, we need to delete the target directory (if it already exists) and create the new required directories:

    If[FileExistsQ[target], DeleteDirectory[target, DeleteContents -> True]]
    
    Scan[CreateDirectory[#, CreateIntermediateDirectories -> True]&, targetDirs]
    

    We can now perform the HTML-to-text transformation, safe in the knowledge that the target directories already exist:

    Scan[
      Export[#[[2]], Import[#[[1]], "Plaintext"], "Text"] &
    , Transpose[{htmlFiles, textFiles}]
    ]
    
    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Related Questions

I used javascript for loading a picture on my website depending on which small
I have thousands of HTML files to process using Groovy/Java and I need to
I am using Paperclip to handle profile photo uploads in my app. They upload
I'm new to using the Perl treebuilder module for HTML parsing and can't figure
That's pretty much it. I'm using Nokogiri to scrape a web page what has
link Im having trouble converting the html entites into html characters, (&# 8217;) i
I am trying to understand how to use SyndicationItem to display feed which is
I am reading a book about Javascript and jQuery and using one of the
I have a string like this: La Torre Eiffel paragonata all&#8217;Everest What PHP function
I'm using v2.0 of ClassTextile.php, with the following call: $testimonial_text = $textile->TextileRestricted($_POST['testimonial']); ... and

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.