I run a hmmscan analysis using a FASTA file asking for tabular output format with –tblout option, which is deliberately space-delimited (rather than tab-delimited) and justified into aligned columns.
The file looks like this (this is just a format example)
targetname accession queryname accession e-value score bias
x_x_x PFyyyy.y ContigXXX_0 - x.xe-xx yy.y x.x
x PFyyyy.yy COntigXXX_1 - xe-x yy.y x.x
x_x PFyyyy.y COntigXXX_2 - xe-xx y.y x.x
x_x_x PFyyyy.yy COntigXXX_3 - x.xe-x yy.y x.x
.
..
where target name are for example: Methyltransf or Dimer_tnp_hAT or Nucleotide_trans
where accession are for example: PF13847.1 or PF03407.11 or PF01958.13;
where query name are for example: Contig244_1 or Contig44245_3 or Contig12345_6
where the second accession column is: –
where e.value are for example: 4.0e-10 or 3.5e-15, etc..
and score and bias are numbers in this format: xx.x
What I’d like to do is to cut the queryname column where all the ContigXXX_X with significant hits to protein domains are.
After this I’ll be able to sort them and keep only the first occurence of each Contig and I can compare the file with the results from BlastP and BlastX (where I was already able to get the list of my Contigs that have hits to nr database)
So my question is: How can I cut the column where all my Contigs are?
I’ve been try with grep,sed,cut commands but I haven’t found the right one yet.
I’m new to Unix language and I’m still learning so every suggestions will be really appreciate.
And if my question is not clear just tell me, I can modify it!
or