I should explain as background to this question that I don’t know any Perl, and have a violent allergy to regular expressions
(we all have our weaknesses). I’m trying to figure out why a Perl program won’t accept the data I’m feeding it. I don’t need to understand this program in any depth – I’m just doing a timing comparison.
Consider this assignment statement:
($sample_ls_id) = $sample_ls_id =~ /:\w\w(\d+):/;
If I understand this correctly, it is checking if sample_ls_id matches some regex, and if so, assigning the entire string, or something like that.
However, I don’t understand how this works.
According to the documentation, namely perldoc perlretut, which I looked at briefly
$sample_ls_id =~ /:\w\w(\d+):/
just returns true or false if there is a match.
The strings I’m trying to match look like
1000 10 0 0 1 urn:lsid:dcc.hapmap.org:Individual:CEPH1000.10:1 urn:lsid:dcc.hapmap.org:Sample:SAMPLE1:1
This fails with the error
Use of uninitialized value $sample_ls_id in concatenation (.) or string
at database/populate/family.pl line 38, <INPUT> line 1.
Line 38 is
print OUTPUT "$sample_ls_id\t$family_ped_id\t$individual_ped_id\t$father_ped_id\t$mother_ped_id\t$sex\t$created_by\t$population_code\n";
See the complete script below. However, the apparently very similar string
1420 9 0 0 1 urn:lsid:dcc.hapmap.org:Individual:CEPH1420.09:1 urn:lsid:dcc.hapmap.org:Sample:NA12003:1
seems to pass.
For context, the entire piece of code is:
use strict;
use warnings;
use Getopt::Long;
my $input_file = "data/family_ceu.txt";
my $output_file = "sql/family_ceu.sql";
my $population_code = "CEU";
GetOptions ('i=s' => \$input_file,
'o=s' => \$output_file,
'p=s' => \$population_code
);
usagecheck();
my $created_by = 'gwas_analyzer';
print "Creating SQL file for inserting family data from $input_file\n";
open (INPUT, "< $input_file");
open (OUTPUT, "> $output_file");
print OUTPUT "INSERT INTO population (population_code, private) VALUES ('$population_code', 'f');\n";
print OUTPUT "COPY family (ls_id, family_ped_id, individual_ped_id, father_ped_id, mother_ped_id, sex, created_by, population_code) FROM stdin;
";
while (my $line = <INPUT>)
{
chomp $line;
#Skip any comment lines
next if($line =~ /^#/);
my ($family_ped_id, $individual_ped_id, $father_ped_id, $mother_ped_id, $sex, $individual_ls_id, $sample_ls_id) = split (/\t/, $line);
($sample_ls_id) = $sample_ls_id =~ /:\w\w(\d+):/;
print OUTPUT "$sample_ls_id\t$family_ped_id\t$individual_ped_id\t$father_ped_id\t$mother_ped_id\t$sex\t$created_by\t$population_code\n";
}
print OUTPUT "\\.\n";
close OUTPUT;
sub usagecheck
{
if (!$input_file || !$output_file || !$population_code)
{
print "Missing argument (see required arguments below):\n";
usage();
exit;
}
}
sub usage
{
print "perl family.pl -i <input file> -o <output file> -p <population code>\n";
}
I’m sure this is a very simple question if you know regexes and Perl.
When
$sample_ls_id = 'urn:lsid:dcc.hapmap.org:Sample:SAMPLE1:1';The regular expression ‘/:\w\w(\d+):/;’ fails. This regular expression would pass when the string has a colon ‘:’ followed by a “word” character ‘\w’,
another “word” character ‘\w’ followed by one or more digits ‘\d+’ and a colon ‘:’.
When
$sample_ls_id = 'urn:lsid:dcc.hapmap.org:Sample:NA12003:1';The regular expression ‘/:\w\w(\d+):/;’ finds its match in
‘:NA12003:’. ( colon, 2 word characters, digits and a colon ).
‘( $sample_ls_id )’ captures the ‘(\d+)’ portion of the match ( also stored in $1 ), which in this case would be 12003.
You were getting an error with the earlier example, because the regular expression fails and leaves ‘($sample_ls_id)’ undefined.