Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • Home
  • SEARCH
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 201469
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: May 11, 20262026-05-11T17:12:06+00:00 2026-05-11T17:12:06+00:00

I have a large number of text files (1000+) each containing an article from

  • 0

I have a large number of text files (1000+) each containing an article from an academic journal. Unfortunately each article’s file also contains a “stub” from the end of the previous article (at the beginning) and from the beginning of the next article (at the end).

I need to remove these stubs in preparation for running a frequency analysis on the articles because the stubs constitute duplicate data.

There is no simple field that marks the beginning and end of each article in all cases. However, the duplicate text does seem to formatted the same and on the same line in both cases.

A script that compared each file to the next file and then removed 1 copy of the duplicate text would be perfect. This seems like it would be a pretty common issue when programming so I am surprised that I haven’t been able to find anything that does this.

The file names sort in order, so a script that compares each file to the next sequentially should work. E.G.

bul_9_5_181.txt
bul_9_5_186.txt

are two articles, one starting on page 181 and the other on page 186. Both of these articles are included bellow.

There is two volumes of test data located at [http://drop.io/fdsayre%5D%5B1%5D

Note: I am an academic doing content analysis of old journal articles for a project in the history of psychology. I am no programmer, but I do have 10+ years experience with linux and can usually figure things out as I go.

Thanks for your help

FILENAME: bul_9_5_181.txt

SYN&STHESIA

ISI

the majority of Portugese words signifying black objects or ideas relating to black. This association is, admittedly, no true synsesthesia, but the author believes that it is only a matter of degree between these logical and spontaneous associations and genuine cases of colored audition.
REFERENCES

DOWNEY, JUNE E. A Case of Colored Gustation. Amer. J. of Psycho!., 1911, 22, S28-539MEDEIROS-E-ALBUQUERQUE. Sur un phenomene de synopsie presente par des millions de sujets. / . de psychol. norm, et path., 1911, 8, 147-151. MYERS, C. S. A Case of Synassthesia. Brit. J. of Psychol., 1911, 4, 228-238.

AFFECTIVE PHENOMENA — EXPERIMENTAL
BY PROFESSOR JOHN F. .SHEPARD
University of Michigan

Three articles have appeared from the Leipzig laboratory during the year. Drozynski (2) objects to the use of gustatory and olfactory stimuli in the study of organic reactions with feelings, because of the disturbance of breathing that may be involved. He uses rhythmical auditory stimuli, and finds that when given at different rates and in various groupings, they are accompanied by characteristic feelings in each subject. He records the chest breathing, and curves from a sphygmograph and a water plethysmograph. Each experiment began with a normal record, then the stimulus was given, and this was followed by a contrast stimulus; lastly, another normal was taken. The length and depth of breathing were measured (no time line was recorded), and the relation of length of inspiration to length of expiration was determined. The length and height of the pulsebeats were also measured. Tabular summaries are given of the number of times the author finds each quantity to have been increased or decreased during a reaction period with each type of feeling. The feeling state accompanying a given rhythm is always complex, but the result is referred to that dimension which seemed to be dominant. Only a few disconnected extracts from normal and reaction periods are reproduced from the records. The author states that excitement gives increase in the rate and depth of breathing, in the inspiration-expiration ratio, and in the rate and size of pulse. There are undulations in the arm volume. In so far as the effect is quieting, it causes decrease in rate and depth of

182

JOHN F. SHEPARD

breathing, in the inspiration-expiration ratio, and in the pulse rate and size. The arm volume shows a tendency to rise with respiratory waves. Agreeableness shows

  • 1 1 Answer
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-05-11T17:12:07+00:00Added an answer on May 11, 2026 at 5:12 pm

    Here’s is the beginning of another possible solution in Perl (It works as is but could probably be made more sophisticated if needed). It sounds as if all you are concerned about is removing duplicates across the corpus and don’t really care if the last part of one article is in the file for the next one as long as it isn’t duplicated anywhere. If so, this solution will strip out the duplicate lines leaving only one copy of any given line in the set of files as a whole.

    You can either just run the file in the directory containing the text files with no argument or alternately specify a file name containing the list of files you want to process in the order you want them processed. I recommend the latter as your file names (at least in the sample files you provided) do not naturally list out in order when using simple commands like ls on the command line or glob in the Perl script. Thus it won’t necessarily compare the correct files to one another as it just runs down the list (entered or generated by the glob command). If you specify the list, you can guarantee that they will be processed in the correct order and it doesn’t take that long to set it up properly.

    The script simply opens two files and makes note of the first three lines of the second file. It then opens a new output file (original file name + ‘.new’) for the first file and writes out all the lines from the first file into the new output file until it finds the first three lines of the second file. There is an off chance that there are not three lines from the second file in the last one but in all the files I spot checked that seemed to be the case because of the journal name header and page numbers. One line definitely wasn’t enough as the journal title was often the first line and that would cut things off early.

    I should also note that the last file in your list of files entered will not be processed (i.e. have a new file created based off of it) as it will not be changed by this process.

    Here’s the script:

    #!/usr/bin/perl
    use strict;
    
    my @files;
    my $count = @ARGV;
    if ($count>0){
        open (IN, "$ARGV[0]");
        @files = <IN>;
        close (IN);
    } else {
        @files = glob "bul_*.txt";
    }
    $count = @files;
    print "Processing $count files.\n";
    
    my $lastFile="";
    foreach(@files){
        if ($lastFile ne ""){
            print "Processing $_\n";
            open (FILEB,"$_");
            my @fileBLines = <FILEB>;
            close (FILEB);
            my $line0 = $fileBLines[0];
                if ($line0 =~ /\(/ || $line0 =~ /\)/){
                        $line0 =~ s/\(/\\\(/;
                        $line0 =~ s/\)/\\\)/;
                }
            my $line1 = $fileBLines[1];
            my $line2 = $fileBLines[2];
            open (FILEA,"$lastFile");
            my @fileALines = <FILEA>;
            close (FILEA);
            my $newName = "$lastFile.new";
            open (OUT, ">$newName");
            my $i=0;
            my $done = 0;
            while ($done != 1 and $i < @fileALines){
                if ($fileALines[$i] =~ /$line0/ 
                    && $fileALines[$i+1] == $line1
                    && $fileALines[$i+2] == $line2) {
                    $done=1;
                } else {
                    print OUT $fileALines[$i];
                    $i++;
                }
            }
            close (OUT);
        }
        $lastFile = $_;
    }
    

    EDIT: Added a check for parenthesis in the first line that goes into the regex check for duplicity later on and if found escapes them so that they don’t mess up the duplicity check.

    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Related Questions

I have a large number of text files containg data arranged into a fixed
I have a text file that lists the names of a large number of
I have split a large text file into a number of sets of smaller
I have memory mapped a large formatted (text) file containing one integer per line
I have two large XML files(c.100MB) containing a number of items. I want to
I have a large number of image files that i need to rename from
I have a large text file of records, each delimited by a newline. Each
I have a large text file with tokens in each line. I want to
I have a large number of basic text, rtf, html, pdf and chm files
i have large numbers of text files and i am in problem that i

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.