I am not a programmer, but I am taking a bioinformatics class as I’m a molecular genetics major… our assignment is to take a file of multiple entries like this:
77: XP_001929585
PREDICTED: similar to BRCA1 associated protein [Sus scrofa]
gi|194042959|ref|XP_001929585.1| [194042959]
and extract the items I’ve bolded, then save the results into a pipe-delimited file like this:
194042959|Sus scrofa|PREDICTED: similar to BRCA1 associated protein.
We are using the Sublime editor to write our scripts in Ruby. I know how to open the file and then… well, here’s my script so far…
#!/usr/local/bin/ruby
File.open("mmg231_hw5_brca1.txt").each do |file_line|
if file_line =~ /^(.+)\[([A-Z].+)\]/
description = $1
taxon_name = $2
puts "#{taxon_name}|#{description}"
elsif file_line =~ /\[([0-9].+)\]/
gi_number = $1
puts "#{gi_number}"
end
end
I know that it’s wrong… the regular expressions do capture what they need to. the first puts does put out the taxon name and the description properly, but I can’t figure out how to get the gi number in there too, as its on a different line… I can pull out the gi number on its own also, but have no way of linking it to the other two parts. Also, when I pull them out using the regular expressions I developed, they stay in the right order as they were in the file, so I was trying to think of a way to tell the computer to like number each taxon name/description pair 1, 2, 3, etc as in the file, and then do the same with the gi numbers, and then you could just say like taxon name/description 1 goes with gi number 1, etc… or have the computer get the taxon name and description pair, then just look in the next line for the gi number, but I don’t know how to do this…
help? And in plain english would be helpful, most help sites I feel like I would be able to use, but I just don’t understand the language…
first 4 entries:
1: ZP_00239925
BRCA1 [Bacillus cereus G9241]
gi|47569239|ref|ZP_00239925.1||gnl|WGS:NZ_AAEK|BCE_G9241_3679 [47569239]
2: NP_009225
breast cancer 1, early onset isoform 1 [Homo sapiens]
gi|6552299|ref|NP_009225.1| [6552299]
3: NP_033894
breast cancer 1 [Mus musculus]
gi|161016835|ref|NP_033894.3| [161016835]
4: NP_036646
breast cancer 1 [Rattus norvegicus]
gi|6978573|ref|NP_036646.1| [6978573]
Do the lines always come in pairs?
If so, why not doing: