Readme for NCBI blast ftp site
                       Last updated on February 15, 2004

This file lists the subdirectories and files found on the NCBI BLAST 
ftp site (ftp://ftp.ncbi.nlm.nih.gov/blast/).  It provides the basic 
information on file content, and on how the files should be used. 


1. Introduction

NCBI BLAST ftp site provides standalone blast, client server blast, 
and wwwblast packages for different platforms.  It also provides 
commonly used blast databases in preformatted as well as FASTA format. 
Some documents on the blast executables and other related subjects are
also provided.


2. File list and content

A description of the files are listed in the tables below, one table 
for each directory or subdirectory.
 
2.1 ftp://ftp.ncbi.nlm.nih.gov/blast/ directory content

The blast ftp directory contains several subdirectories each for a 
specific set of files.  

+------------------+-------------------------------------------------+
|Name              |Content                                          |
+------------------+-------------------------------------------------+
blastftp.txt        this file

db                  subdirectory with database, in preformatted or 
                      FASTA form

demo                demonstration programs and documents from blast 
                      developers

documents           documents for programs in standalone blast, 
                      netblast, and wwwblast programs

executables         archives for binary distribution of blast programs

matrices            protein and nucleotide score matrices, only a 
                      subset are supported by blast

temp                temporary directory for miscellaneous files
+------------------+-------------------------------------------------+


2.2 File content for ftp://ftp.ncbi.nlm.nih.gov/blast/db/ subdirectory

Databases larger than two gigabytes (2 GB) are formatted in multiple 
volumes, which are named using the “database.##.tar.gz” convention. 
All relevant volumes are required. An alias file is provided so that 
the database can be called using the alias name without the extension 
(.nal or .pal). For example, to call est database, simply use “–d est” 
option in the commandline (without the quotes). 

Certain databases are subsets of a larger parental database. For those 
databases, mask files, rather than actual databases, are provided. The
mask file needs the parent database to function properly. The parent 
databases should be generated on the same day as the mask file. For 
example, to use swissprot preformatted database, swissprot.tar.gz, one 
will need to get the nr.tar.gz with the same date stamp.

To use the preformatted blast database file, first inflate the file 
using gzip (unix, linux), WinZip (window), or StuffIt Expander (Mac), 
then extract the component files out from the resulting tar file using 
tar (unix, linux), WinZip (Window), or StuffIt Expander (Mac). The 
resulting files are ready for BLAST. 

+---------------------+----------------------------------------------+
|Name                 |Content                                       |
+---------------------+----------------------------------------------+
FASTA                  subdirectory with databases in FASTA format

blastdb.txt            content list of the blast database

est.00.tar.gz          first volume of the est database
est.01.tar.gz          second volume of the est database
est.02.tar.gz          third volume of the est database
                       all volumes are needed to reconstitute 
                         complete est database 

est_human.tar.gz       human est database, a mask file requires both 
                         volumes of est to work

est_mouse.tar.gz       mouse est database, a maks file needs both 
                         volumes of est to work

est_others.tar.gz      est database without human/mouse entries, a
                         mask file reqires both volumes of est

gss.tar.gz             genomic survery sequence database

htgs.00.tar.gz         first volume of the htgs database
htgs.01.tar.gz         second volume of the htgs database
htgs.02.tar.gz         all volumes are needed to reconstitute
htgs.03.tar.gz           complete htgs database 

human_genomic.tar.gz   human chromosome database containing 
                         concatenated contigs with adjusted gaps 
                         represented by N's

nr.tar.gz              non-redundant protein database

nt.00.tar.gz           first volume of the nucleotide nr database
nt.01.tar.gz           second volume of the nucleotide nr database
nt.02.tar.gz           all volumes are needed to reconstitute
                         complete nt database 

other_genomic.tar.gz   chromosome database for organisms other than 
                         human

pataa.tar.gz           patent protein database

patnt.tar.gz           patent nucleotide database

pdbaa.tar.gz           protein sequence database for pdb entries. It
                         is mask file and requires nr.tar.gz

pdbnt.tar.gz           nucleotide sequence database for pdb entries. 
                         They are not coding sequences for the 
                         corresponding protein structure entries!

sts.tar.gz             sequence tag site database

swissprot.tar.gz       swissprot sequence database, last major
                         release. It is mask file and requires 
                         nr.tar.gz to work properly

taxdb.tar.gz           taxonomy id database for use with new version 
                         of blast database (not fully implemented yet)

wgs.00.tar.gz          first volume of wgs assembly database
wgs.01.tar.gz          second volume of the wgs assembly database.
wgs.02.tar.gz          third volume of the wgs assembly database.
wgs.03.tar.gz          fourth volume of the wgs assembly database.
wgs.04.tar.gz          fifth volume of the wgs assembly database.
wgs.05.tar.gz          sixth volume of the wgs assembly database.
                         all volumes are needed.
+--------------------+-----------------------------------------------+


2.2.1 File content for ftp://ftp.ncbi.nlm.nih.gov/blast/db/FASTA 
subdirectory

he FASTA database files are now stored in this subdirectory, it does 
contain some additional databases that are not available via the NCBI 
BLAST pages. Due to file size issues, the full est database is not 
provided. One needs to get the three subsets and concatenate them 
together to get the complete est database.

These databases will need to be formatted using formatdb program found 
in the standalone blast executable package.  The recommended 
commandlines to use are:

        formatdb –i input_db –p F –o T          for nucleotide  
                
        formatdb –i input_db –p T –o T          for protein

For additional information on formatdb, please see the formatdb.txt 
document under /blast/documents/ directory.

+------------------+--------------------------------------------------+
|Name              |Content                                           |
+------------------+--------------------------------------------------+
 alu.a.gz             proteins translated from alu.n 

 alu.n.gz             alu repeat sequences
 
 drosoph.aa.gz        Drosophila protein from genome annotation 

 drosoph.nt.gz        Drosophila genome

 ecoli.aa.gz          E.coli K-12 proteins from genome annotation 

 ecoli.nt.gz          E.coli K-12 genomic contigs

 est_human.gz         human subset of the est database

 est_mouse.gz         mouse subset of the est database

 est_others.gz        subset of est other than human or mouse entries

 gss.gz               Genomic Survey Sequences (mostly BAC ends) 

 htgs.gz              High Throughput Genomic Sequences

 human_genomic.gz     Human chromosomes formed by concatenating genomic 
                        contig assemblies (NT_######) and adjusting the 
                        gaps with N’s

 igSeqNt.gz           Immunoglobulin nucleotide sequences

 igSeqProt.gz         Immunoglobulin protein sequences

 mito.aa.gz           protein from the annotated mitochondrial genomes

 mito.nt.gz           mitochondrial genomes
month.aa.gz 
protein
                        sequences released or updated in the past 30 days

 month.est_human.gz   human subset of EST released/updated in the past 
                        30 days

 month.est_mouse.gz   mosue subset of EST released/updated in the past 
                        30 days

 month.est_others.gz  EST, wihtout entries from human or mouse, released
                       or updated in the past 30 days

 month.gss.gz         gss entries released/updated in the past 30 days 

 month.htgs.gz        htgs entries released/updated in the past 30 days

 month.nt.gz          subset of nt released/updated in the past 30 days

 nr.gz                non-redundant protein sequence database

 nt.gz                nucleotide database from GenBank excluding the
                        batch division htgs, est, gss,sts, pat divisions, 
                        and wgs entries.  Not non-redundant.

 other_genomic.gz     Chromosome entries other than human

 pataa.gz             Patent protein sequence database 

 patnt.gz             Patent nucleotide sequence database 

 pdbaa.gz             protein sequences for pdb entries

 pdbnt.gz             nucleotide entries for pdb entries.  They are NOT
                        the coding sequence forthe corresponding
                        protein entries

 sts.gz               Sequence Tag Sites database 

 swissprot.gz         swissprot database, last major release

 vector.gz            vector sequences from synthetic (syn) division
                        of GenBank

 wgs.gz               Whole Genome Shotgun sequence assembly

 yeast.aa.gz          protein translations from yeast genome annotation 

 yeast.nt.gz          yeast genomic sequence
+------------------+----------------------------------------------------+


2.3 File content for ftp://ftp.ncbi.nlm.nih.gov/blast/demo/ directory

This directory contains some technical presentations from the BLAST 
developers along with some demo tools or documentation relevant to BLAST.

+------------------------+-----------------------------------------------+
|Name                    |Content                                        |
+------------------------+-----------------------------------------------+
 README.blast_demo         readme for blast_demo package

 README.first              readme for this directory

 README.parse_blast_xml    readme for parse_blast_xml package

 blast_demo.tar.gz         blast_demo package on blast db, blast object, 
                             and reformating blast alignment from 
                             blastobj file

 blast_exercises.doc       blast exercise questions answers

 blast_programming.ppt     PowerPoint presentation on BLAST programing

 blast_talk.ppt            PowerPoint presentation (O'Reilly conference)

 ieee_blast.final.ppt      PowerPoint presentation (IEEE conference)

 ieee_talk.pdf             Above IEEE presentation in PDF format

 parse_blast_xml.tar.gz    demo package on parsing xml styled blast output

 splitd.ppt                PowerPoint presentation on NCBI BLAST server’s 
                             splitd implementation

 test_suite.tar.gz         test package
+------------------------+-----------------------------------------------+


2.4 File content for ftp://ftp.ncbi.nlm.nih.gov/blast/documents/ directory

This directory contains copies of the documentation on different BLAST 
programs distributed from this ftp site under the /blast/executables/ 
directory. blast.txt also contains detailed release history.

+------------------------+-----------------------------------------------+
|Name                    |Content                                        |
+------------------------+-----------------------------------------------+
 blast.txt                 readme for blastall and blastpgp

 blastclust.txt            readme for blastclust

 developer                 subdirectory with additional documentation

 blast_seqalign.txt        describing seqalign function

 readdb.txt                describing readdb function

 urlapi.txt                a short introduction on BLAST URL API which 
                             supersedes the blasturl

 formatdb.txt              readme for formatdb program

 impala.txt                readme for impala

 megablast.txt             readme for megablast

 netblast.txt              readme for netblast (blastcl3)

 rpsblast.txt              readme for rpsblast

 xml                       subdirectory with .dtd and .mod field 
                             description files for blast xml output

 xml/NCBI_BlastOutput.dtd  dtd file for blast xml output
 xml/NCBI_BlastOutput.mod  mod file for blast xml output
 xml/NCBI_Entity.mod       mod file for NCBI xml file
 xml/README.blxml          readme on blast xml output
+------------------------+-----------------------------------------------+


2.5 File content for ftp://ftp.ncbi.nlm.nih.gov/blast/executables/ 
directory

This directory contains several subdirectories each for a specific 
subsets of executable BLAST programs:

/LATEST-BLAST subdirectory contains the standalone blast binaries from 
        the latest major versioned release.

/LATEST-NETBLAST sudirectory contains the netblast binaries from the
        latest major versioned release. 

/LATEST-WWWBLAST subdirectory contains the wwwblast binaries from the
        latest major versioned release.

/release different releases, with the last one linked to LATEST 
        directories

/snapshot subdirectory contains patches or intermediate updates put up in 
        between major releases. For previous releases, go to release 
        subdirectory, where the old major releases are archived back to 
        version 2.0.10.



2.5.1 File content for ftp://ftp.ncbi.nlm.nih.gov/blast/executables/LATEST-BLAST,
 /LATEST-NETBLAST, and /LATEST-WWWBLAST subdirectories

All these three subdirectories link to the latest release directory, 
which contains the standalone BLAST executables package (blast initialed 
archives), blastcl3 client (netblast initialed archives), and server blast 
(wwwblast initialed archives).  

The standalone archive is needed to set up BLAST locally on user's own 
machine. It also provides the tools necessary to prepare custom databases 
and retrieve sequences from these prepared databases.  Different archives 
for commonly used platforms are available. 

The blast client archive contains the blastcl3 program which functions by 
formulating BLAST search locally first and forwarding the search to NCBI 
blast server for process. The search results returned by NCBI BLAST server 
is saved to an user-specified file on local computer disk.  

The server blast archive contains the web pages with embedded blast search 
forms similar to that of NCBI that can process the BLAST search request against 
local set of databases and return the result to a browser window. wwwblast 
is now in sync with the NCBI toolkit and the two above two packages.


+------------------------------------+-------------------------------+
|Name                                |Content                        |
+------------------------------------+-------------------------------+
 MD5SUM.txt

 blast-2.2.8-alpha-osf1.tar.gz        Standalone for COMPAQ/HP alpha 
                                       machine (OSF 5.1 and above)

 blast-2.2.8-amd64-linux.tar.gz       Standalone for AMD 64-bits PC 
                                        running linux

 blast-2.2.8-ia32-freebsd.tar.gz      Standalone for intel Pentium PC
                                        running freeBSD 

 blast-2.2.8-ia32-linux.tar.gz        Standalone for intel Pentium PC 
                                        running Linux

 blast-2.2.8-ia32-win32.exe           Standalone for intel Pentium PC 
                                        running Windows 

 blast-2.2.8-ia64-linux.tar.gz        Standalone for intel Itanium PC 
                                        running Linux

 blast-2.2.8-mips-irix-32-bit.tar.gz  Standalone for 32-bits SGI

 blast-2.2.8-mips-irix.tar.gz         Standalone for 64-bits SGI

 blast-2.2.8-powerpc-macosx.tar.gz    Standalone for MacOSX (terminal)

 blast-2.2.8-sparc-solaris.tar.gz     Standalone for Sun Sparc station 
                                         running Solaris 

 netblast-2.2.8-alpha-osf1.tar.gz     netblast for COMPAQ/HP alpha
                                        machine (OSF 5.1 and above)

 netblast-2.2.8-amd64-linux.tar.gz    netblast for AMD 64-bits PC
                                        running Linux

 netblast-2.2.8-ia32-freebsd.tar.gz   netblast for intel Pentium PC
                                        running freeBSD

 netblast-2.2.8-ia32-linux.tar.gz     netblast for intel Pentium PC 
                                         running Linux

 netblast-2.2.8-ia32-win32.exe        netblast for for intel Pentium
                                         PC running Windows 

 netblast-2.2.8-ia64-linux.tar.gz     netblast for for intel Itanium PC
                                         running Linux

 netblast-2.2.8-mips-irix.tar.gz      netblast for SGI 32-bits system 

 netblast-2.2.8-powerpc-macosx.tar.gz netblast for MacOSX

 netblast-2.2.8-sparc-solaris.tar.gz  netblast for Sun Sparc station
                                        running Solaris 

 wwwblast-2.2.8-alpha-osf1.tar.gz     wwwblast for COMPAQ/HP alpha
                                        machine (OSF 5.1 and above)

 wwwblast-2.2.8-amd64-linux.tar.gz    wwwblast for AMD 64-bits PC 
                                        running Linux

 wwwblast-2.2.8-ia32-freebsd.tar.gz   wwwblast for Intel Pentium PC 
                                        running Linux

 wwwblast-2.2.8-ia32-linux.tar.gz     wwwblast for Intel Pentium PC
                                        running Linux

 wwwblast-2.2.8-ia64-linux.tar.gz     wwwblast for Intel Itanium PC
                                        running Linux

 wwwblast-2.2.8-mips-irix.tar.gz      wwwblast for SGI 32-bits system

 wwwblast-2.2.8-powerpc-macosx.tar.gz wwwblast for MacOSX

 wwwblast-2.2.8-sparc-solaris.tar.gz  wwwblast for Sun Sparc station 
                                        running Solaris 
+------------------------------------+-------------------------------+


2.5.2 File content for ftp://ftp.ncbi.nlm.nih.gov/blast/executables/release 
subdirectory

This directory contains past major releases of BLAST, as far back as 
version 2.0.10. Each release is in its own subdirectory. 


2.5.3 File content for ftp.ncbi.nlm.nih.gov/blast/executables/snapshot 
subdirectory

This subdirectory contains intermediate enhanced or patched archives 
released after the last major release.  They are organized according 
to the date and only contains the binaries for the affected platforms.


2.5.4 File content for ftp.ncbi.nlm.nih.gov/blast/executables/special 
subdirectory

From time to time, we make binaries for some rare platforms under 
special circumstances.  Those files are archived here.


2.6 File content ftp://ftp.ncbi.nlm.nih.gov/blast/matrices directory

This directory contains the scoring matrices, which are files that can 
be used by BLAST alignment assessment.  The file are text files with 
special format that can be viewed directly by a browser.

For valid statistical analysis, blastn uses only identity matrix and 
blastp only supports a limited subset of the BLOSUM and PAM matrices: 
BLOSUM 45, 62, 80, plus PAM30 and 70.


2.7 File content of the ftp://ftp.ncbi.nlm.nih.gov/blast/temp 
subdirectory

An left-over subdirectory of miscellaneous files or tools. 


3. Techinical Support

Additional questions/comments on this ftp site should be directed to 
NCBI blast-help group at:
        blast-help@ncbi.nlm.nih.gov

Other questions on general NCBI resources should be directed to:
        info@ncbi.nlm.nih.gov