PRIAM_search Program (Beta) tutorial Requierements To work, PRIAM_search need the ncbi BLAST stand-alone applications to be installed (ftp://ftp.ncbi.nlm.nih.gov/blast/executables/blast+/LATEST/) and a complete release of PRIAM (only releases of PRIAM constructed after the August 2010 work with this program) Program arguments: -bd : To work, PRIAM needs the ncbi BLAST suite to be installed. However, depending on how you have installed the ncbiBLAST package, it may be necessary to tell PRIAM where the BLAST binaries are located on your computer. So, to know if you have to use this option, simply type "blastpgp -" in your console. If your computer doesn't complain and show you the blastpgp help menu, it means that the "bd" option would be unecessary for you. Else, find the directory where the "blastpgp" and the "formatrpsdb" BLAST binaries are located (must be in the same directory) and tell it to PRIAM with the "bd" option (for exemple "-bd /usr/local/blast/bin"). This option normally needs to be used only the first time you use PRIAM (or in case the directory of the blast binaries has changed). -np : Number of processors to use. Only usefull for multicores processors (not for clusters of computers) -n : Name of your job. Would be used to name outputfiles (and, in case your run crashed, to recover it). -i : The path to the file containing the protein sequences you want to analyse (Fasta format). Each sequence name needs to be unique as in the results files sequences would only be identified by their names. Take care, PRIAM is not able to work with nucleic sequences yet. So, if you have nucleic sequences you will need to translate them into proteic ones before giving them to PRIAM. -p : path of the directory containing the release of PRIAM you want to use. -od : output directory where all intermediates and results files would be written. -e : erase job intermediate files? (T/F). Would be done only if the job successfully complete. -pt : The threshold of probability above which an activity, represented by a set of matching profiles, is considered as present. PRIAM is now able to associate each hit of a profile with a bayesian probability for this hit of beeing a true positive. That is the joint probability of the set of profiles characterizing an activity that is used as criteria to decide whether this activity must be kept in final predictions or not. -mo : Maximum overlap length between the matches of two profiles. By default in PRIAM, all the matches bypassing the probability threshold are kept to annotate a sequence. However, to increase the specificity, it is possible to consider only the best non overlapping profiles (In that case, only the main activity might be predicted). Thus, the "-mo" allows to define the length from which two matches are considered as overlapping (setting this parammeter to "-1" means that this filter is innactivated). We recommand to use this filter in case of a low probability threshold. -mp : Minimal length proportion of a profile that must be matched to consider it. Usually, enzymes catatlytic domains have a constrained structure with a quite conserved genomic length. Thus, truncated domains have a really low probability to be functional (or at least to have kept their enzymatic specificity). So, in the case of a functional annotation, incomplete profiles matches can be removed by setting this parammetter to a convenient value (typically 60-80). However, if you are interested in complete genome annotation and if you assume that many of your genes are incomplete (for exemple in the case of small contigs) it may be necessary to set down this value to have interpretable results. -cg : Analyse dataset as a complete genome? (T/F). If you want to analyse a complete genome, this option must be set to true. This would results in the use of PRIAM genome annotation rules that define the minimal set of modules needed to be found in a genome to ensure a given enzymatic activity (See PRIAM publication of 2003 for more details). -cc : Check for catalytic residues patterns? (T/F). Each profile of PRIAM may be associated with a pattern of catalytic residues which is automatically designed using Swiss-Prot annotations (when available). So, some positions of a pattern can be tagged as corresponding to known catalitic residues. If you choose to activate this catalytic patterns checkout, PRIAM would thus verify, for each match, that known catalytic residues are found. If not the case, this match would be considered as a false positive. Using this option would thus increase specificity, as it alows to predict inactive enzymes as false positives, but it can also significatively impact sensibility on some enzymes (enzymes for which the biological diversity was badly represented in the training dataset of PRIAM). So use it with caution. Output files: In the output directory, PRIAM creates two subdirectories: - A directory called "DATA" that contains a copy of your query file splitted into subfiles. - A directory called "RESULTS" that contains all the results concerning your job. and a file containing the parametters used for the job. In the RESULT directory you would find many files. All the files concerning a same job have the same prefix: "paj_"+name of your job. Among these files you should find: - a file ending with the "_predictableECs.txt" sufix. This file lists all the EC numbers the PRIAM release you use for this job is able to predict. So, if in the results you do not find an EC you were expecting to be present, please ensure this EC is in that list before puting PRIAM into question ;) - a file ending with the "_seqsHits.tab" sufix. This file correspond to the list of all profiles hits for each sequence in the query file. - a file ending with the "_seqsECs.txt" sufix. This file correspond to the list of ECs predicted for each sequence in the query file. This is that file that must be looked at in case you use PRIAM for sequence annotation purpose. In that file, for each sequence, are reported all activities with at least one representative profile having matched. After the activity is indicated the joint e-value of representative profiles corresponding to this activity, then, the probability of this prediction and, then, whether this prediction should be kept or not (T=kept; F=not kept). Finally, an optional "(fragment)" tag can be found that warn that a truncated representative domain has matched for the considered activity. So, if the probability for this activity is low, it can be just because of an incomplete protein sequence. Be aware, however, that this information is based only on representative modules. Thus, not all incomplete polypeptides could be identified that way. In case of a genome annotation job you should also found in the RESULT directory: - a file ending with the "_genomeECs.txt" sufix that gives the list of all ECs predicted has beeing present in your genome (according to the PRIAM genome annotation rules) with their associated probability. - a directory "PATHWAYMAPS" that contains a list of all KEGG reference pathway maps colored according to PRIAM predictions for your genome. - a directory "METANET" that contains the stoichiometric model corresponding to your genome metabolic network in the form of a stoichiometry matrix and a ScrumPy model file. Recommanded options: If you are interested in annotating independant protein sequences: We suggest you to start using the following parametters: -pt 0.5 -mo 20 -mp 70 -cc T -cg F These parametters are quite stringent thus you should minimise the number of wrong annotations Then you can relax some parametters (-pt 0.5 -mo -1 -mp 60 -cc F -cg F) if you want a better sensibility (but it would also results in a decrease of specificity) If you are interested in annotating a complete genome: First try with: -pt 0.5 -mo -1 -mp 70 -cc T -cg T If your genome correspond to an organism belonging to a domain of life badly representated in the Swiss-prot database, it may be necessary to relax scores thresholds. Thus try with: -pt 0.2 -mo -1 -mp 60 -cc F -cg T If you have any question or if you encounter an issue with this program, feel free to send me an e-mail at: bernard@biomserv.univ-lyon1.fr