mkdom2 and xdom: the documentation
- What is mkdom2/xdom ?
- Starting a new project
- Setting up an environment
- Starting the mkdom2 script
- Quality checks:
- Postprocessing and data analysis:
- Citing the mkdom2 program
What is mkdom2/xdom ?
mkdom2is the program we use routinely to build each new release of ProDom.The algorithm is described elsewhere (Gouzy et al., 1999), but let's briefly say that it relies on the assumption that the shortest amino acid sequence corresponds to a single domain, and may be used as a query to screen the database with the psi-blast program, in order to cluster homologous domains. For building ProDom, we run this program on the whole swissprot/trembl database, but it can be run on any set of protein sequences (as long as you have a fasta file).
xdomis a graphical program which will help you to analyze domains detected by mkdom2, as it visualises all domain arrangements in the protein set.
It may be considered as a good practice to install the package in the directory
/usr/local, as all users will have the opportunity to run
mkdom2/xdom2. However this is not required, and you may prefer to install the pakage in your home directory, thus dispensing with root privilege.
You must have a computer not too old, with enough memory. The program was tested with a pentium 3 computer with 512Mb memory (linux) and on a Sun Ultra Sparc 480 MHz (SunOS) machine. The programs run currently only on Linux/Intel-86 or SunOS/sparc based systems. However, you may install the package in a directory shared by several machines of both architectures.
You must have
perlinstalled , version 5.6.1 or more. No module other than those found in the standard perl distribution are required, except for the Df module written by Ian Guthrie. For better convenience, this module is integrated into the mkdom2/xdom2 distribution, so you should not have to bother about this.
You should use
Unpacking the distribution and editing .login
Unpack the distribution with:gunzip < xdom2.0-tar.gz | tar xvf -
This should create a directory called
Xdom2.0, whose content looks like this:ls -l Xdom2.0
drwxr-xr-x 4 manu prodom 4096 Dec 16 12:32 bin
drwxrwxr-x 3 manu prodom 4096 Dec 15 14:01 doc
drwxrwxr-x 5 manu prodom 4096 Dec 15 12:14 lib
-rw-rw-r-- 1 manu prodom 3658 Dec 17 15:53 mkdom2setup.pl
-rw-rw-r-- 1 manu prodom 1479 Dec 18 08:54 README
-rw-rw-r-- 1 manu prodom 229 Dec 17 16:36 setup.csh
drwxr-xr-x 2 manu prodom 4096 Dec 18 14:30 Test
Please run the script
mkdom2_install.plusing the command:perl Xdom2.0/mkdom2_install.pl
This will create a csh file called
setup_SunOS.csh. You'll have to
sourcethis file before executing mkdom2 or xdom2. You may also source the file called
setup.csh, which will automatically call the good setup file considering the machine's architecture.
It may be useful to add the following line at the end of your
.loginfile:cd Xdom2.0; source setup.csh; cd
Testing the program
Before starting, it is important to test the program to be sure that everything works perfectly well. This can be done simply with the command:mkdom2test.pl
cfg.plto configure a directory suitable for running
mkdom2with a test file as input.
- changes to this directory.
- calls mkdom2.
- calls the Unix command
diffto check the differences between the obtained result and a reference file.
Please note the results of mkdom2 are different when you run the program on different architectures, due to differences in the implementation of the sort routines. Thus we have a reference file for each supported architecture.
Should something be different between the two files, the program would tell that the test did not succeed, in which case you could have a look to the files:
Xdom2.0/Test/Test.51and the locally generated
Test_lcl.51to try investigating the problem.
Starting a new project
A whole domain analysis with
mkdom2/xdom2includes the following operations:
- Setting up an environment
- Check the results files to detect possible problems
- post-process the data
- Look and may be print the data with the
Setting up an environment
Before starting the
mkdom2program, you have to create a working environment (i.e. some directories and files that will be used by the programs). This is done with executing the script
mkdom2cfg.pl. You'll have to answer some questions:
- a name for the project
- a version number (any string will be OK): default is a string containing today's date.
- Do you want to use some expert domains at the beginning of the clustering process ? (see later).
- The fasta-formatted input file name.
Let's say your project is called
organism, and the version number is
20031225: a directory called
organism-20031225is then created, with some files or directories inside, as shown under:$ ls -l organism-20031225
drwxrwxr-x 2 manu prodom 4096 Dec 11 15:37 checkpoint
drwxrwxr-x 2 manu prodom 4096 Dec 11 15:37 data
-rw-rw-r-- 1 manu prodom 48 Dec 11 15:37 mkdom2.conf
Starting the mkdom2 script
You have to change directory to
mkdom2may be started with the command:mkdom2 IN=organism.fasta LOG=mkdom2.log &
LOGswitch is not required, however if not specified
mkdom2logs to the standard output, which might be not very convenient, should program execution last a long time (typically several hours or days for big fasta files).
The time stamps
From time to time, and especially for each blastpgp execution, a time stamp is calculated and formed as follows:#03#12#09#07#27#54#11#
This stamp is the coded value of a date, here December 9th, 2003 at 7:27:54. Stamps may be generated at a relatively high rate, and the last number (11) makes sure that the stamps are different, even if the time did change less that a second. Those stamps are used to check the synchronization between the many created files.
Checkpointing the data during the execution
The data are checkpointed from time to time, so that in case of unpredictable interruption, as few data as possible would be lost. The important temporary files are automatically copied to the directory
checkpoint/<stamp>, where <stamp> is the stamp generated at the moment of the checkpoint. However, only 2 subdirectories are kept under the
checkpointdirectory in order to avoid disk saturation.
Interrupting mkdom2 in an orderly manner
mkdom2may run during a very long period of time, depending on the data. It may thus be useful to be able to interrupt the program without loosing the already executed job. This can be done by the creation of an empty file called
The program looks from time to time for the existence of this file, thus it can go on during a few minutes before stopping its execution. The data are then checkpointed, and saved in the directory called
current(this is in fact a symbolic link to a directory named
Retrieving data after an unpredicted interruption
Should the program be interrupted in an unpredicted way (after an electrical shutdown, a system crash, etc.), it would be necessary to retrieve the last checkpointed files before resuming the operation: you thus have to identify (using the time stamp) the most recent subdirectory in the
checkpointdirectory, then change to this directory and type the following commands:cp * ../../data/current
This copies every file found in this directory to the
currentresults directory for later reference. However, please note the computations performed between this checkpointing and the time of interruption will be lost.
Resuming the execution
Resuming the process after an ordered interruption, or after an unpredicted interruption followed by a successful retrieval of the data is an easy task:cp current/organism.fasta.SL .
mkdom2 IN=organism.fasta.SL LOG=mkdom2.log
Please note that the input file is now the file
organism.fasta.SL, that is the original fasta file sorted in sequence length, and purged from the already found domains.
Looking at the log and result files
The process may be monitored during the
mkdom2execution, mainly looking at the clustering log file, and at the temporary results file, respectively called
Mkdom2.tmp.prodom.51. The following shows some lines from the clustering log file: it can be seen that a few families are generated, then from time to time the database is reorganized (some domains are taken out of the database, the database is sorted again, and the utility formatdb is run). When this occurs, the program checks the remaining disk space, because a disk full could lead to incorrect results and data loss: should the disk space drop too much, the program would be gently interrupted: you should then remove some files in order to recover disk space, then resume the program.#03#12#11#15#45#49#00# _M_ PSIBLAST OK - FAM 477
#03#12#11#15#45#49#00# _M_ NOW REORGANIZING DATABASE
#03#12#11#15#45#49#00# _M_ DISK SPACE (Ko) = 6681744 - NEEDED = 37137
#03#12#11#15#45#53#00# _M_ PSIBLAST OK - UNIQ 478
#03#12#11#15#45#53#01# _M_ PSIBLAST OK - UNIQ 479
#03#12#11#15#45#54#00# _M_ PSIBLAST OK - UNIQ 480
#03#12#11#15#45#54#01# _M_ PSIBLAST OK - UNIQ 481
#03#12#11#15#45#54#02# _M_ PSIBLAST OK - FAM 482
#03#12#11#15#45#54#02# _M_ NOW REORGANIZING DATABASE
#03#12#11#15#45#54#02# _M_ DISK SPACE (Ko) = 6681744 - NEEDED = 37140
The following shows the corresponding lines, extracted from the intermediate results file:Cluster #477: ---------------------------------------------
// STAMP #03#12#11#15#45#49#00#
// QUERY GSTEN:00003660:P:001#1#32
Set # 477:
[GSTEN:00003660:P:001 1 32
[GSTEN:00021931:P:001 22 54
Cluster #478: ---------------------------------------------
// STAMP #03#12#11#15#45#53#00#
// QUERY GSTEN:00021931:P:001#1#21
Set # 478:
[GSTEN:00021931:P:001 1 21 S=115
Cluster #479: ---------------------------------------------
// STAMP #03#12#11#15#45#53#01#
// QUERY GSTEN:00005831:P:001#1#32
Set # 479:
[GSTEN:00005831:P:001 1 32 S=163
Cluster #480: ---------------------------------------------
// STAMP #03#12#11#15#45#54#00#
// QUERY GSTEN:00005952:P:001#1#32
Set # 480:
[GSTEN:00005952:P:001 1 32 S=200
Cluster #481: ---------------------------------------------
// STAMP #03#12#11#15#45#54#01#
// QUERY GSTEN:00006207:P:001#1#32
Set # 481:
[GSTEN:00006207:P:001 1 32 S=182
Cluster #482: ---------------------------------------------
// STAMP #03#12#11#15#45#54#02#
// QUERY GSTEN:00006690:P:001#1#32
Set # 482:
[GSTEN:00006690:P:001 1 32
[GSTEN:00010495:P:001 44 75
In order to verify the consistency of the log files, the
mkdom2ck.plscript tests for synchronization problems which may cause data loss or data corruption: this could happen in particular if the program is interrupted during the process and incorrectly resumed.
mkdom2ck.plperforms the following checks:
- Are the result directories 1 2 3... covering the whole process, whithout overlap or without any gap ?
- Are the log files and the result files in each directory 1 2 3 synchronized ? (the time stamps are used for this purpose).
- for each result directory, look for sequences which were withdrawn from the database: these sequences must be found in the result file.
- The last check tries to find each sequence of the source database in one of the result files. Please note that in cases of interruptions, the previous checks make sense even if the whole process is not completed. This last check, however, makes sense only after the whole process is completed. You may skip this check, just calling
mkdom2ck.plwith the switch
--no_db_checkto check in incomplete process.
The check may be done with:mkdom2ck.pl --db organism.fasta [--no_db_check]
Postprocessing and data analysis
The data must now be postprocessed, which implies:
Concatenation of all the results files (they are for now split between directories 1 2 3...). Transformation of those files to a standard ProDom file and to an xdom format file Computation of the multiple alignments of domains in each family (this rather long step may be skipped). Creation of a project file ready to be read by the
The postprocessing may be done with:mkdom2pp.pl --db organism.fasta [--no_alignment]
Viewing the data with the xdom program
You can now admire, print, think about your data with the
Citing the mkdom2 program
Should you use this program for a publication, please cite the following reference: Gouzy J., Corpet F. & Kahn D. (1999). Whole genome protein domain analysis using a new method for domain clustering, Computers and Chemistry. 23:333-340.