I am trying to tag terms using a list of annotations. What I am specifically trying to achieve is that if the Perl regex identifies a term from a sentence it should tag the term with tags.
For example:
This drug has adverse effect on Lymphocytes, Lymphnodes, Lymph and pre-lymphocytes.
My list has the word Lymph, I am trying the following script.
open IN, "clean_cells.txt" or die "import file absent";
@array=<IN>;
foreach $words(@array)
{
@cells=split/\t/,$words;
$value=$cells[0];
$replace=$cells[1];
foreach my $fp (glob("$Directory/*.txt"))
{
@id=split('/',$fp);
$id[1]=~s/.txt//ig;
$Pub=$id[1];
open FILE, "<",$fp or die "Can't open $fp: $!";
open OUT, ">C:\\Users\\Desktop\\TM\\Files\\$Pub" or die "Check output status";
while(<FILE>)
{
chomp $_;
$line=$_;
s/\b[\w\-]*$value[\w\-]*\b/<$replace>$&<\\$replace>/gi;
# $string[$i]=$line;
# while(($string[$i]=~m/\Q$value\E/i)|| ($string[$i]=~m/\Q$value(\w+)\E/i)||($string[$i]=~m/\Q(\w+)$value\E/i))
# # if ($string[$i] =~ m/\b\w*$value\w*\b/i)
# {
# $value=~s/$value/<$replace>$value<\$replace>/i;
# }
print OUT "$line\n";
}
last;
}
last;
}
I am hoping the final sentence should look like this:
This drug has adverse effect on tag Lymphocytes tag, tag Lymphnodes tag, tag Lymph tag and tag pre-lymphocytes tag.
tag: represents $replace in the above script.
The program tags the base word lymoh and not the entire term Lymphocytes, pre-lymphocytes.
You need to keep your words together. The tricky part with that is determining what characters can make up words. A simpler approach (but perhaps not as exact) is to determine what makes up the delimiters. For example, you can use
\S+to match consecutive non-whitespace characters:Output:
Note that this is a non-destructive method, because the parens in the split regex will capture and return all the parts of the string.
This simplistic code will preserve your whitespace, though as you can see, it will put commas and other such separator characters inside your tags. This can be fixed by using another character class, such as
[^\s,.!?]+(not whitespace, comma, period, exclamation point or question mark).If you replace
<DATA>with<>, you can use this script with redirection and skip the code about opening input and output files.I would personally prefer such functionality, rather than hard-coded file paths, and it is often the way *nix programs work.