mkdom2 and xdom: the documentation


What is mkdom2/xdom ?

mkdom2 is the program we use routinely to build each new release of ProDom.The algorithm is described elsewhere (Gouzy et al., 1999), but let's briefly say that it relies on the assumption that the shortest amino acid sequence corresponds to a single domain, and may be used as a query to screen the database with the psi-blast program, in order to cluster homologous domains. For building ProDom, we run this program on the whole swissprot/trembl database, but it can be run on any set of protein sequences (as long as you have a fasta file).

xdom is a graphical program which will help you to analyze domains detected by mkdom2, as it visualises all domain arrangements in the protein set.


Installation

It may be considered as a good practice to install the package in the directory /usr/local, as all users will have the opportunity to run mkdom2/xdom2. However this is not required, and you may prefer to install the pakage in your home directory, thus dispensing with root privilege.

Prerequisites

You must have a computer not too old, with enough memory. The program was tested with a pentium 3 computer with 512Mb memory (linux) and on a Sun Ultra Sparc 480 MHz (SunOS) machine. The programs run currently only on Linux/Intel-86 or SunOS/sparc based systems. However, you may install the package in a directory shared by several machines of both architectures.
You must have perl installed , version 5.6.1 or more. No module other than those found in the standard perl distribution are required, except for the Df module written by Ian Guthrie. For better convenience, this module is integrated into the mkdom2/xdom2 distribution, so you should not have to bother about this.
You should use csh or tcsh shell.

Unpacking the distribution and editing .login

Unpack the distribution with:

gunzip < xdom2.0-tar.gz | tar xvf -

This should create a directory called Xdom2.0, whose content looks like this:

ls -l Xdom2.0
total 32
drwxr-xr-x 4 manu prodom 4096 Dec 16 12:32 bin
drwxrwxr-x 3 manu prodom 4096 Dec 15 14:01 doc
drwxrwxr-x 5 manu prodom 4096 Dec 15 12:14 lib
-rw-rw-r-- 1 manu prodom 3658 Dec 17 15:53 mkdom2setup.pl
-rw-rw-r-- 1 manu prodom 1479 Dec 18 08:54 README
-rw-rw-r-- 1 manu prodom 229 Dec 17 16:36 setup.csh
drwxr-xr-x 2 manu prodom 4096 Dec 18 14:30 Test

Please run the script mkdom2_install.pl using the command:

perl Xdom2.0/mkdom2_install.pl

This will create a csh file called setup_Linux.csh or setup_SunOS.csh. You'll have to source this file before executing mkdom2 or xdom2. You may also source the file called setup.csh, which will automatically call the good setup file considering the machine's architecture.
It may be useful to add the following line at the end of your .login file:

cd Xdom2.0; source setup.csh; cd

Testing the program

Before starting, it is important to test the program to be sure that everything works perfectly well. This can be done simply with the command:

mkdom2test.pl

This script:

Should something be different between the two files, the program would tell that the test did not succeed, in which case you could have a look to the files: Xdom2.0/Test/Test.51 and the locally generated Test_lcl.51 to try investigating the problem.


Starting a new project

A whole domain analysis with mkdom2/xdom2 includes the following operations:

Setting up an environment

Before starting the mkdom2 program, you have to create a working environment (i.e. some directories and files that will be used by the programs). This is done with executing the script mkdom2cfg.pl. You'll have to answer some questions:

Let's say your project is called organism, and the version number is 20031225: a directory called organism-20031225 is then created, with some files or directories inside, as shown under:

$ ls -l organism-20031225
total 12
drwxrwxr-x 2 manu prodom 4096 Dec 11 15:37 checkpoint
drwxrwxr-x 2 manu prodom 4096 Dec 11 15:37 data
-rw-rw-r-- 1 manu prodom 48 Dec 11 15:37 mkdom2.conf

Starting the mkdom2 script

You have to change directory to data, then mkdom2may be started with the command:

mkdom2 IN=organism.fasta LOG=mkdom2.log & 

The LOG switch is not required, however if not specified mkdom2 logs to the standard output, which might be not very convenient, should program execution last a long time (typically several hours or days for big fasta files).

The time stamps

From time to time, and especially for each blastpgp execution, a time stamp is calculated and formed as follows:

#03#12#09#07#27#54#11#

This stamp is the coded value of a date, here December 9th, 2003 at 7:27:54. Stamps may be generated at a relatively high rate, and the last number (11) makes sure that the stamps are different, even if the time did change less that a second. Those stamps are used to check the synchronization between the many created files.

Checkpointing the data during the execution

The data are checkpointed from time to time, so that in case of unpredictable interruption, as few data as possible would be lost. The important temporary files are automatically copied to the directory checkpoint/<stamp>, where <stamp> is the stamp generated at the moment of the checkpoint. However, only 2 subdirectories are kept under the checkpoint directory in order to avoid disk saturation.

Interrupting mkdom2 in an orderly manner

mkdom2 may run during a very long period of time, depending on the data. It may thus be useful to be able to interrupt the program without loosing the already executed job. This can be done by the creation of an empty file called MKD.stop:

touch MKD.stop

The program looks from time to time for the existence of this file, thus it can go on during a few minutes before stopping its execution. The data are then checkpointed, and saved in the directory called current (this is in fact a symbolic link to a directory named 1, 2,...).

Retrieving data after an unpredicted interruption

Should the program be interrupted in an unpredicted way (after an electrical shutdown, a system crash, etc.), it would be necessary to retrieve the last checkpointed files before resuming the operation: you thus have to identify (using the time stamp) the most recent subdirectory in the checkpoint directory, then change to this directory and type the following commands:

cp * ../../data/current

This copies every file found in this directory to the current results directory for later reference. However, please note the computations performed between this checkpointing and the time of interruption will be lost.

Resuming the execution

Resuming the process after an ordered interruption, or after an unpredicted interruption followed by a successful retrieval of the data is an easy task:

cp current/organism.fasta.SL .
mkdom2 IN=organism.fasta.SL LOG=mkdom2.log

Please note that the input file is now the file organism.fasta.SL, that is the original fasta file sorted in sequence length, and purged from the already found domains.


Quality checks:

Looking at the log and result files

The process may be monitored during the mkdom2 execution, mainly looking at the clustering log file, and at the temporary results file, respectively called Mkdom2.tmp.LogMKD and Mkdom2.tmp.prodom.51. The following shows some lines from the clustering log file: it can be seen that a few families are generated, then from time to time the database is reorganized (some domains are taken out of the database, the database is sorted again, and the utility formatdb is run). When this occurs, the program checks the remaining disk space, because a disk full could lead to incorrect results and data loss: should the disk space drop too much, the program would be gently interrupted: you should then remove some files in order to recover disk space, then resume the program.

#03#12#11#15#45#49#00# _M_ PSIBLAST OK - FAM 477 
#03#12#11#15#45#49#00# _M_ NOW REORGANIZING DATABASE
#03#12#11#15#45#49#00# _M_ DISK SPACE (Ko) = 6681744 - NEEDED = 37137
#03#12#11#15#45#53#00# _M_ PSIBLAST OK - UNIQ 478
#03#12#11#15#45#53#01# _M_ PSIBLAST OK - UNIQ 479
#03#12#11#15#45#54#00# _M_ PSIBLAST OK - UNIQ 480
#03#12#11#15#45#54#01# _M_ PSIBLAST OK - UNIQ 481
#03#12#11#15#45#54#02# _M_ PSIBLAST OK - FAM 482
#03#12#11#15#45#54#02# _M_ NOW REORGANIZING DATABASE
#03#12#11#15#45#54#02# _M_ DISK SPACE (Ko) = 6681744 - NEEDED = 37140

The following shows the corresponding lines, extracted from the intermediate results file:

Cluster #477: ---------------------------------------------
// STAMP #03#12#11#15#45#49#00#
// QUERY GSTEN:00003660:P:001#1#32

Set # 477:
[GSTEN:00003660:P:001 1 32
[GSTEN:00021931:P:001 22 54

Cluster #478: ---------------------------------------------
// STAMP #03#12#11#15#45#53#00#
// QUERY GSTEN:00021931:P:001#1#21

Set # 478:
[GSTEN:00021931:P:001 1 21 S=115

Cluster #479: ---------------------------------------------
// STAMP #03#12#11#15#45#53#01#
// QUERY GSTEN:00005831:P:001#1#32

Set # 479:
[GSTEN:00005831:P:001 1 32 S=163

Cluster #480: ---------------------------------------------
// STAMP #03#12#11#15#45#54#00#
// QUERY GSTEN:00005952:P:001#1#32

Set # 480:
[GSTEN:00005952:P:001 1 32 S=200

Cluster #481: ---------------------------------------------
// STAMP #03#12#11#15#45#54#01#
// QUERY GSTEN:00006207:P:001#1#32

Set # 481:
[GSTEN:00006207:P:001 1 32 S=182

Cluster #482: ---------------------------------------------
// STAMP #03#12#11#15#45#54#02#
// QUERY GSTEN:00006690:P:001#1#32

Set # 482:
[GSTEN:00006690:P:001 1 32
[GSTEN:00010495:P:001 44 75

Executing mkdom2ck.pl

In order to verify the consistency of the log files, the mkdom2ck.pl script tests for synchronization problems which may cause data loss or data corruption: this could happen in particular if the program is interrupted during the process and incorrectly resumed. mkdom2ck.pl performs the following checks:

The check may be done with:

mkdom2ck.pl --db organism.fasta [--no_db_check]

Postprocessing and data analysis

Executing mkdom2pp.pl

The data must now be postprocessed, which implies:

  • Concatenation of all the results files (they are for now split between directories 1 2 3...).
  • Transformation of those files to a standard ProDom file and to an xdom format file
  • Computation of the multiple alignments of domains in each family (this rather long step may be skipped).
  • Creation of a project file ready to be read by the xdom visualization program.

    The postprocessing may be done with:

    mkdom2pp.pl --db organism.fasta [--no_alignment]

    Viewing the data with the xdom program

    You can now admire, print, think about your data with the xdom2 program:

    xdom2 organism.prj

    Citing the mkdom2 program

    Should you use this program for a publication, please cite the following reference: Gouzy J., Corpet F. & Kahn D. (1999). Whole genome protein domain analysis using a new method for domain clustering, Computers and Chemistry. 23:333-340.