Crash course in PERL

Reading arguments in PERL

In general, our scripts will read a number of arguments, like the name of the input files, parameter values, a flag, etc... These arguments can be incorporated into our PERL program using the variable @ARGV. It is then necessary to check that these values are defined before proceeding with the program. This check, in fact, is also a good chance to give some information to the user on how to use the program. Here we give an example for two arguments:

#!/usr/bin/perl -w

use strict;

my $arg0 = $ARGV[0];
my $arg1 = $ARGV[1];

if ( !$arg0 && !$arg1 ){
   print STDERR "program to calculate blah-blah\n";
   print STDERR "Usage: $0 argument1 argument2\n";
   print STDERR "argument1 is such-and-such\n";
   print STDERR "argument2 is such-and-such\n";
   exit(0);
}
...

The messages are printed via standard error if any of the arguments is not defined, and the program is exited with exit(0). This messages explains how to use the program, whose name is held by the variable $0, and the meaning of the arguments.

This simple trick is very effective, since simply by running the program without arguments will produce this information; and it is not a priori necessary to know any of the parameters and options of the program.


Reading files in PERL

For many problems, the simplest and most convenient way to represent our data is in a plain text file, with one item per line, and tab separated.

chr14   63631555        63631576        hsa-miR-548    .       -
chr14   100405206       100405227       hsa-miR-493    .       +
chr14   43426666        43426687        hsa-miR-548    .       +
chr14   65007623        65007643        hsa-miR-625    .       -
chr14   100591390       100591412       hsa-miR-668    .       +
chr14   100591517       100591538       hsa-miR-485    .       +

This is very easy to read with perl. We can read the file line by line:

#!/usr/bin/perl -w
use strict;

my $file = $ARGV[0];
if ( !$file ){
   print STDERR "Usage: $0 file_name\n";
   exit(0);
}

open(FILE,"<$file") or die("cannot open $file");
while (<FILE>){
    chomp;
    my $line = $_;
    ...
}

Reading tab-separated files in PERL - vectors

If each line is tab separated, we can store the column values per line in a vector:

#!/usr/bin/perl -w
use strict;

my $file = $ARGV[0];
if ( !$file ){
   print STDERR "Usage: $0 file_name\n";
   exit(0);
}

open(FILE,"<$file") or die("cannot open $file");
while (<FILE>){
    chomp;
    my @line = split("\t", $_);    #### <--- we split on tabs
    ...
}

The command split "automagically" splits the input of one line (that is held in $_\) into scalar values. This line is, in fact, completely equivalent to this:

    my ($item_1, $item_2, ... , $item_N) = split("\t", $_);

and to this:

    my ($item_1, $item_2, ... , $item_N) = split;

A vector is a list of values, where each value is associated to an index in the vector. Indexes start counting from 0:

my @line   = ($item_1, $item_2, ..., $item_N);
my $item_i = $line[$i-1];

The length of a vector @vector is given by scalar(@vector).


Splitting symbols in a string

The function split can also be used to split one column entry into individual characters. Consider for instance this input file

prot1        VVTTQVTIPKDLAGSIIGKGGQRIKQIRHDSGAAIKIDEPLEGSEDRIITITGTQDQIQNAQF
prot2        TVLMVYGLNPKMTCDRVFNILCLYGNVMKVKFLKSKPGTAMVQMGDPSAVGRAISNLSGMKFFDEKLVLAPSKQ
prot3        TVVRLRGLPFGCSKEEIVQFFVPNGITLTLDYQGRSTGEAFVQFASKEIAENALGKHKERIGHRYIEIFKSSK
prot4        VFVRLRGLPFQCSKEEVAQFFSGLEIVPNGITLPLDNGRSTGEAYVEFGSPESAEKALTKHKEKIGHRYIEIFKSSK

And we are interested in obtaining the symbols in each sequence. We can do the following

#!/usr/bin/perl -w
use strict;
...

while (<FILE>){
    chomp;
    my ($id, $seq) = split("\t", $_);
    
    my @aminoacids = split(//,$seq);   ##### <-- we split on the empty string                          
    ...
}

This will fill the array @aminoacids with as many elements as characters in the string $seq.

We can access the elements in the array by specifying the index in the list of elements, individually:

my $first_AA  = $aminoacids[0];
my $second_AA = $aminoacids[1];
...

my $last_AA   = $aminoacids[-1];

or using a loop


for(my $i=0; $i < scalar(@aminoacids); $i++){
   my $current_AA = $aminoacids[$i];                                 
}


Counting symbols - hashes

Imagine that we have already obtained a vector with all the symbols in the string. For instance, all the aminoacids. How can we count how many times appear each symbol?

We can use hashes. A hash (a.k.a. look-up table or dictionary) is a data structure similar to a vector but with two main differences:

Tha hash is defined a two lists: one list of "keys" and a list for "values", such that for every unique key we have an associated (not-necessarily unique) value. A hash can be defined in the following way:

#!/usr/bin/perl -w

use strict;

my %polarity = ("Alanine"       => "nonpolar",
                "Arginine"      => "polar",
                "Asparagine"    => "polar",
                ...
                "Phenylalanine" => "nonpolar");

One value from this hash can be accessed using the corresponding key:

my $result = $polarity{Alanine};

The list of all keys or all values from the hash can also be accessed:

my @aminoacids  = keys( %polarity );
my @polarities  = values( %polarity );

We have now the elements to be able to write a program that reads information from a list of sequences and count the number of times we see each symbol:

#!/usr/bin/perl -w
use strict;

my $file = $ARGV[0];
if ( !$file ){
   print STDERR "Usage: $0 file_name\n";
   exit(0);
}

my %aminoacid_count;
open(FILE,"<$file") or die("cannot open $file");
while (<FILE>){
    chomp;
    my ($id, $seq) = split("\t", $_);
    my @aminoacids = split(//,$seq);                           
    for(my $i=0; $i < scalar(@aminoacids); $i++){
        $aminoacid_count{$aminoacids[$i]}++;      
    }
}



Technical tip: Reading FASTA files: The Tbl format

For this and other problems we will have to read a multi-fasta file: a file with multiple sequences in Fasta format. The easiest way to do this is to conver the fasta file, e.g.:

>ENSE00001297638|ENSG00000070371
GATCGTGGCTACTTTGAGGAGCTGATCTTGCTGTTGGAAGCGGCCCTGGGCCTGGAGCGG
GCCCACATGGGCATGTTCACTGAGCTGGCCATCCTCTACTCCAAATTCAAGCCACAGAAG
ATGCTGGAGCATCTGGAGCTTTTCTGGTCCCGTGTCAACATCCCAAAG
>ENSE00001318650|ENSG00000070371|
GTGTGCTTTGCCTGCATGGATGGACAAGAGTTCCGCTTCGCACAGCTGTGTGGTCTTCAC
ATCGTCATTCATGCAGATGAGCTGGAGGAGCTGATGTGCTATTACCAG
...

In table format, which has one line per sequence, and separates the ids and the sequence in two tab-separated columns:

ENSE00001297638|ENSG00000070371    GATCGTGGCTACTTTGAGGAGCTGATCTTGCTGTTGGAAGCGGCCCTGGGCCTGGAGCGGGCCCACATGGGCATGTTCACTGAGCTGGCCATCCTCTACTCCAAATTCAAGCCACAGAAGATGCTGGAGCATCTGGAGCTTTTCTGGTCCCGTGTCAACATCCCAAAG
ENSE00001318650|ENSG00000070371    GTGTGCTTTGCCTGCATGGATGGACAAGAGTTCCGCTTCGCACAGCTGTGTGGTCTTCACATCGTCATTCATGCAGATGAGCTGGAGGAGCTGATGTGCTATTACCAG
...

This "Table" or Tbl format is very easy to read from a script with the typical loop structure:

#!/usr/bin/perl -w



use strict;

while(<STDIN>){
   chomp;
   my ( $id, $seq ) = split (/\t/, $_);
}
...

The FastaToTbl.pl program: This will be the PERL code to convert a multi-fasta file into Tbl format:

#!/usr/bin/perl -w

use strict;

my $count = 0;
while(){
    chomp;
    if ( $_=~/\>(\S+)/ ){
        print "\n" if $count > 0;
        print $1."\t";
        $count++;
    }
    else{
        print $_;
    }
}
print "\n";


Try this out: write a program TblToFasta.pl that performs the reverse conversion, i.e. transforms a Tbl file into a Fasta file.