I have many csv files that I need to clean ( replace ponctuation by space and replace certain words by others…) My csv files have two columns and in each one I replace some caracters by others. For exemple, in the first column I replace ; by xxx and in the second column I replace ; by ppp. To do that, I have two perl codes in regex and I slice one csv file on two files : file 1 = first column and file 2 = second column and I run the code for the first columns on the file of the first colums ….
It’s not a good way at all :s !!!
So how can I have one code in which the first condition run on the first column and the second condition on the second column of the SAME file ?
CSV example :
http://dbpedia.org/resource/Berenguer_de_Cru%C3%AFlles Berenguer de Cruïlles
http://dbpedia.org/resource/Berenguer_de_Cru%C3%AFlles Berenguer de Cruïlles
The IRI is the first column and the names are in the second one.
Perl code in regex for the first column :
use strict;
use warnings;
open(IN,$ARGV[0]);
open(OUT,">RES_xxx.txt");
while(my $l = <IN>)
{
chomp($l);
$l =~ s/http:\/\//_/g;
$l =~ s/,/vvv/g;
$l =~ s/"/=/g;
$l =~ s/'/#/g;
$l =~ s/\(/ééé/g;
$l =~ s/\)/èèè/g;
$l =~ s/%/zzz/g;
print OUT "$l\n";
}
close(IN);
close(OUT);
Perl code in regex for the second column :
#! usr/bin/perl
use strict;
use warnings;
open(IN,$ARGV[0]);
open(OUT,">RES_xxx.txt");
while(my $l = <IN>)
{
chomp($l);
$l =~ s/\(.+\)/ /g;
$l =~ s/'/ /g;
$l =~ s/"/ /g;
$l =~ s/,/ /g;
$l =~ s/\./ /g;
$l =~ s/:/ /g;
$l =~ s/;/ /g;
$l =~ s/!/ /g;
$l =~ s/\?/ /g;
$l =~ s/-/ /g;
$l =~ s/_/ /g;
$l =~ s/{/ /g;
$l =~ s/}/ /g;
$l =~ s/\+/ /g;
$l =~ s/=/ /g;
print OUT "$l\n";
}
close(IN);
close(OUT);
Thank you !
You could do that by parsing your file in two steps:
on the first step, you replace the
;in the first column of the original file;on the second step, you replace the
;in the second column on the output of the first step.This should be easily done from your current solution: I suppose you have a regex to match the first column and the second column. You can simply change those regex so that instead of matching the first or second columns, they replace within that column.
If you provide more details about your files and how you currently split the two columns, I might provide some concrete example.
EDIT:
Since it seems that you only have two columns and each does not contain any commas, you could do like this:
parse the file line by line;
split the line at the
,(separator between the columns);on each part you got at step 2, apply the regex to replace what you want.
E.g.: