EMBOSS: skipredundant

skipredundant

Wiki

The master copies of EMBOSS documentation are available at http://emboss.open-bio.org/wiki/Appdocs on the EMBOSS Wiki.

Please help by correcting and extending the Wiki pages.

Function

Remove redundant sequences from an input set

Description

Redundancy in a database or other collection of sequences occurs when one or more similar sequences are present. The inclusion of very similar sequences in certain analyses will introduce undesirable bias. For example, a family may possess 100 sequences in the sequence database, but 90 of these might be essentially the same sequence, e.g. very close relatives or mutations of a single sequence. Although 100 sequences are known, the family only contains 11 sequences that are essentially unique. For many applications it is desirable or even essential to remove redundant sequences from a set in order to produce a smaller set that is representative of the whole. SEQNR removes redundancy from an input file of sequences, either at a single threshold of sequence similiarty (e.g. 40%) or within a threshold range of sequence similiarty (e.g. 40% - 70%).

Algorithm

Redundancy is calculated for each pair of sequences in turn.

Usage

Here is a sample session with skipredundant

% skipredundant -threshold 80 -redundant "" Remove redundant sequences from an input set Input sequence set: globins.fasta Redundancy removal options 1 : Single threshold percentage sequence similarity 2 : Outside a range of acceptable threshold percentage similarities Select number [1]: Gap opening penalty [10.0]: Gap extension penalty [0.5]: output sequence(s) [globins.keep]:

Go to the input files for this example
Go to the output files for this example

Command line arguments

Remove redundant sequences from an input set
Version: EMBOSS:6.6.0.0

   Standard (Mandatory) qualifiers (* if not always prompted):
  [-sequences]         seqset     Sequence set filename and optional format,
                                  or reference (input USA)
   -mode               menu       [1] This option specifies whether to remove
                                  redundancy at a single threshold percentage
                                  sequence similarity or remove redundancy
                                  outside a range of acceptable threshold
                                  percentage similarity. All permutations of
                                  pair-wise sequence alignments are calculated
                                  for each set of input sequences in turn
                                  using the EMBOSS implementation of the
                                  Needleman and Wunsch global alignment
                                  algorithm. Redundant sequences are removed
                                  in one of two modes as follows: (i) If a
                                  pair of proteins achieve greater than a
                                  threshold percentage sequence similarity
                                  (specified by the user) the shortest
                                  sequence is discarded. (ii) If a pair of
                                  proteins have a percentage sequence
                                  similarity that lies outside an acceptable
                                  range (specified by the user) the shortest
                                  sequence is discarded. (Values: 1 (Single
                                  threshold percentage sequence similarity); 2
                                  (Outside a range of acceptable threshold
                                  percentage similarities))
*  -threshold          float      [95.0] This option specifies the percentage
                                  sequence identity redundancy threshold. The
                                  percentage sequence identity redundancy
                                  threshold determines the redundancy
                                  calculation. If a pair of proteins achieve
                                  greater than this threshold the shortest
                                  sequence is discarded. (Any numeric value)
*  -minthreshold       float      [30.0] This option specifies the percentage
                                  sequence identity redundancy threshold
                                  (lower limit). The percentage sequence
                                  identity redundancy threshold determines the
                                  redundancy calculation. If a pair of
                                  proteins have a percentage sequence
                                  similarity that lies outside an acceptable
                                  range the shortest sequence is discarded.
                                  (Any numeric value)
*  -maxthreshold       float      [90.0] This option specifies the percentage
                                  sequence identity redundancy threshold
                                  (upper limit). The percentage sequence
                                  identity redundancy threshold determines the
                                  redundancy calculation. If a pair of
                                  proteins have a percentage sequence
                                  similarity that lies outside an acceptable
                                  range the shortest sequence is discarded.
                                  (Any numeric value)
   -gapopen            float      [10.0 for any sequence] The gap open penalty
                                  is the score taken away when a gap is
                                  created. The best value depends on the
                                  choice of comparison matrix. The default
                                  value assumes you are using the EBLOSUM62
                                  matrix for protein sequences, and the
                                  EDNAFULL matrix for nucleotide sequences.
                                  (Floating point number from 1.0 to 100.0)
   -gapextend          float      [0.5 for any sequence] The gap extension,
                                  penalty is added to the standard gap penalty
                                  for each base or residue in the gap. This
                                  is how long gaps are penalized. Usually you
                                  will expect a few long gaps rather than many
                                  short gaps, so the gap extension penalty
                                  should be lower than the gap penalty. An
                                  exception is where one or both sequences are
                                  single reads with possible sequencing
                                  errors in which case you would expect many
                                  single base gaps. You can get this result by
                                  setting the gap open penalty to zero (or
                                  very low) and using the gap extension
                                  penalty to control gap scoring. (Floating
                                  point number from 0.0 to 10.0)
  [-outseq]            seqoutall  [.] Sequence set(s)
                                  filename and optional format (output USA)
  [-redundantoutseq]   seqoutall  [.] Redundant sequences
                                  (optional)

   Additional (Optional) qualifiers:
   -datafile           matrixf    [EBLOSUM62 for protein, EDNAFULL for DNA]
                                  This is the scoring matrix file used when
                                  comparing sequences. By default it is the
                                  file 'EBLOSUM62' (for proteins) or the file
                                  'EDNAFULL' (for nucleic sequences). These
                                  files are found in the 'data' directory of
                                  the EMBOSS installation.

   Advanced (Unprompted) qualifiers:
   -feature            toggle     Sequence feature information will be
                                  retained if this option is set.

   Associated qualifiers:

   "-sequences" associated qualifiers
   -sbegin1            integer    Start of each sequence to be used
   -send1              integer    End of each sequence to be used
   -sreverse1          boolean    Reverse (if DNA)
   -sask1              boolean    Ask for begin/end/reverse
   -snucleotide1       boolean    Sequence is nucleotide
   -sprotein1          boolean    Sequence is protein
   -slower1            boolean    Make lower case
   -supper1            boolean    Make upper case
   -scircular1         boolean    Sequence is circular
   -squick1            boolean    Read id and sequence only
   -sformat1           string     Input sequence format
   -iquery1            string     Input query fields or ID list
   -ioffset1           integer    Input start position offset
   -sdbname1           string     Database name
   -sid1               string     Entryname
   -ufo1               string     UFO features
   -fformat1           string     Features format
   -fopenfile1         string     Features file name

   "-outseq" associated qualifiers
   -osformat2          string     Output seq format
   -osextension2       string     File name extension
   -osname2            string     Base file name
   -osdirectory2       string     Output directory
   -osdbname2          string     Database name to add
   -ossingle2          boolean    Separate file for each entry
   -oufo2              string     UFO features
   -offormat2          string     Features format
   -ofname2            string     Features file name
   -ofdirectory2       string     Output directory

   "-redundantoutseq" associated qualifiers
   -osformat3          string     Output seq format
   -osextension3       string     File name extension
   -osname3            string     Base file name
   -osdirectory3       string     Output directory
   -osdbname3          string     Database name to add
   -ossingle3          boolean    Separate file for each entry
   -oufo3              string     UFO features
   -offormat3          string     Features format
   -ofname3            string     Features file name
   -ofdirectory3       string     Output directory

   General qualifiers:
   -auto               boolean    Turn off prompts
   -stdout             boolean    Write first file to standard output
   -filter             boolean    Read first file from standard input, write
                                  first file to standard output
   -options            boolean    Prompt for standard and additional values
   -debug              boolean    Write debug output to program.dbg
   -verbose            boolean    Report some/full command line options
   -help               boolean    Report command line options and exit. More
                                  information on associated and general
                                  qualifiers can be found with -help -verbose
   -warning            boolean    Report warnings
   -error              boolean    Report errors
   -fatal              boolean    Report fatal errors
   -die                boolean    Report dying program messages
   -version            boolean    Report version number and exit

Qualifier Type Description Allowed values Default

Standard (Mandatory) qualifiers

[-sequences]
(Parameter 1) seqset Sequence set filename and optional format, or reference (input USA) Readable set of sequences Required

-mode list This option specifies whether to remove redundancy at a single threshold percentage sequence similarity or remove redundancy outside a range of acceptable threshold percentage similarity. All permutations of pair-wise sequence alignments are calculated for each set of input sequences in turn using the EMBOSS implementation of the Needleman and Wunsch global alignment algorithm. Redundant sequences are removed in one of two modes as follows: (i) If a pair of proteins achieve greater than a threshold percentage sequence similarity (specified by the user) the shortest sequence is discarded. (ii) If a pair of proteins have a percentage sequence similarity that lies outside an acceptable range (specified by the user) the shortest sequence is discarded.
1 (Single threshold percentage sequence similarity)
2 (Outside a range of acceptable threshold percentage similarities)
1

-threshold float This option specifies the percentage sequence identity redundancy threshold. The percentage sequence identity redundancy threshold determines the redundancy calculation. If a pair of proteins achieve greater than this threshold the shortest sequence is discarded. Any numeric value 95.0

-minthreshold float This option specifies the percentage sequence identity redundancy threshold (lower limit). The percentage sequence identity redundancy threshold determines the redundancy calculation. If a pair of proteins have a percentage sequence similarity that lies outside an acceptable range the shortest sequence is discarded. Any numeric value 30.0

-maxthreshold float This option specifies the percentage sequence identity redundancy threshold (upper limit). The percentage sequence identity redundancy threshold determines the redundancy calculation. If a pair of proteins have a percentage sequence similarity that lies outside an acceptable range the shortest sequence is discarded. Any numeric value 90.0

-gapopen float The gap open penalty is the score taken away when a gap is created. The best value depends on the choice of comparison matrix. The default value assumes you are using the EBLOSUM62 matrix for protein sequences, and the EDNAFULL matrix for nucleotide sequences. Floating point number from 1.0 to 100.0 10.0 for any sequence

-gapextend float The gap extension, penalty is added to the standard gap penalty for each base or residue in the gap. This is how long gaps are penalized. Usually you will expect a few long gaps rather than many short gaps, so the gap extension penalty should be lower than the gap penalty. An exception is where one or both sequences are single reads with possible sequencing errors in which case you would expect many single base gaps. You can get this result by setting the gap open penalty to zero (or very low) and using the gap extension penalty to control gap scoring. Floating point number from 0.0 to 10.0 0.5 for any sequence

[-outseq]
(Parameter 2) seqoutall Sequence set(s) filename and optional format (output USA) Writeable sequence(s) <*>.format

[-redundantoutseq]
(Parameter 3) seqoutall Redundant sequences (optional) Writeable sequence(s) <*>.format

Additional (Optional) qualifiers

-datafile matrixf This is the scoring matrix file used when comparing sequences. By default it is the file 'EBLOSUM62' (for proteins) or the file 'EDNAFULL' (for nucleic sequences). These files are found in the 'data' directory of the EMBOSS installation. Comparison matrix file in EMBOSS data path EBLOSUM62 for protein
EDNAFULL for DNA

Advanced (Unprompted) qualifiers

-feature toggle Sequence feature information will be retained if this option is set. Toggle value Yes/No No

Associated qualifiers

"-sequences" associated seqset qualifiers

-sbegin1
-sbegin_sequences integer Start of each sequence to be used Any integer value 0

-send1
-send_sequences integer End of each sequence to be used Any integer value 0

-sreverse1
-sreverse_sequences boolean Reverse (if DNA) Boolean value Yes/No N

-sask1
-sask_sequences boolean Ask for begin/end/reverse Boolean value Yes/No N

-snucleotide1
-snucleotide_sequences boolean Sequence is nucleotide Boolean value Yes/No N

-sprotein1
-sprotein_sequences boolean Sequence is protein Boolean value Yes/No N

-slower1
-slower_sequences boolean Make lower case Boolean value Yes/No N

-supper1
-supper_sequences boolean Make upper case Boolean value Yes/No N

-scircular1
-scircular_sequences boolean Sequence is circular Boolean value Yes/No N

-squick1
-squick_sequences boolean Read id and sequence only Boolean value Yes/No N

-sformat1
-sformat_sequences string Input sequence format Any string

-iquery1
-iquery_sequences string Input query fields or ID list Any string

-ioffset1
-ioffset_sequences integer Input start position offset Any integer value 0

-sdbname1
-sdbname_sequences string Database name Any string

-sid1
-sid_sequences string Entryname Any string

-ufo1
-ufo_sequences string UFO features Any string

-fformat1
-fformat_sequences string Features format Any string

-fopenfile1
-fopenfile_sequences string Features file name Any string

"-outseq" associated seqoutall qualifiers

-osformat2
-osformat_outseq string Output seq format Any string

-osextension2
-osextension_outseq string File name extension Any string

-osname2
-osname_outseq string Base file name Any string

-osdirectory2
-osdirectory_outseq string Output directory Any string

-osdbname2
-osdbname_outseq string Database name to add Any string

-ossingle2
-ossingle_outseq boolean Separate file for each entry Boolean value Yes/No N

-oufo2
-oufo_outseq string UFO features Any string

-offormat2
-offormat_outseq string Features format Any string

-ofname2
-ofname_outseq string Features file name Any string

-ofdirectory2
-ofdirectory_outseq string Output directory Any string

"-redundantoutseq" associated seqoutall qualifiers

-osformat3
-osformat_redundantoutseq string Output seq format Any string

-osextension3
-osextension_redundantoutseq string File name extension Any string

-osname3
-osname_redundantoutseq string Base file name Any string

-osdirectory3
-osdirectory_redundantoutseq string Output directory Any string

-osdbname3
-osdbname_redundantoutseq string Database name to add Any string

-ossingle3
-ossingle_redundantoutseq boolean Separate file for each entry Boolean value Yes/No N

-oufo3
-oufo_redundantoutseq string UFO features Any string

-offormat3
-offormat_redundantoutseq string Features format Any string

-ofname3
-ofname_redundantoutseq string Features file name Any string

-ofdirectory3
-ofdirectory_redundantoutseq string Output directory Any string

General qualifiers

-auto boolean Turn off prompts Boolean value Yes/No N

-stdout boolean Write first file to standard output Boolean value Yes/No N

-filter boolean Read first file from standard input, write first file to standard output Boolean value Yes/No N

-options boolean Prompt for standard and additional values Boolean value Yes/No N

-debug boolean Write debug output to program.dbg Boolean value Yes/No N

-verbose boolean Report some/full command line options Boolean value Yes/No Y

-help boolean Report command line options and exit. More information on associated and general qualifiers can be found with -help -verbose Boolean value Yes/No N

-warning boolean Report warnings Boolean value Yes/No Y

-error boolean Report errors Boolean value Yes/No Y

-fatal boolean Report fatal errors Boolean value Yes/No Y

-die boolean Report dying program messages Boolean value Yes/No Y

-version boolean Report version number and exit Boolean value Yes/No N

Input file format

The input is a standard EMBOSS sequence query (also known as a 'USA').

Major sequence database sources defined as standard in EMBOSS installations include srs:embl, srs:uniprot and ensembl

Data can also be read from sequence output in any supported format written by an EMBOSS or third-party application.

The input format can be specified by using the command-line qualifier -sformat xxx, where 'xxx' is replaced by the name of the required format. The available format names are: gff (gff3), gff2, embl (em), genbank (gb, refseq), ddbj, refseqp, pir (nbrf), swissprot (swiss, sw), dasgff and debug.

See: http://emboss.sf.net/docs/themes/SequenceFormats.html for further information on sequence formats.

Input files for usage example

File: globins.fasta

>HBB_HUMAN Sw:Hbb_Human => HBB_HUMAN
VHLTPEEKSAVTALWGKVNVDEVGGEALGRLLVVYPWTQRFFESFGDLSTPDAVMGNPKV
KAHGKKVLGAFSDGLAHLDNLKGTFATLSELHCDKLHVDPENFRLLGNVLVCVLAHHFGK
EFTPPVQAAYQKVVAGVANALAHKYH
>HBB_HORSE Sw:Hbb_Horse => HBB_HORSE
VQLSGEEKAAVLALWDKVNEEEVGGEALGRLLVVYPWTQRFFDSFGDLSNPGAVMGNPKV
KAHGKKVLHSFGEGVHHLDNLKGTFAALSELHCDKLHVDPENFRLLGNVLVVVLARHFGK
DFTPELQASYQKVVAGVANALAHKYH
>HBA_HUMAN Sw:Hba_Human => HBA_HUMAN
VLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHFDLSHGSAQVKGHGK
KVADALTNAVAHVDDMPNALSALSDLHAHKLRVDPVNFKLLSHCLLVTLAAHLPAEFTPA
VHASLDKFLASVSTVLTSKYR
>HBA_HORSE Sw:Hba_Horse => HBA_HORSE
VLSAADKTNVKAAWSKVGGHAGEYGAEALERMFLGFPTTKTYFPHFDLSHGSAQVKAHGK
KVGDALTLAVGHLDDLPGALSNLSDLHAHKLRVDPVNFKLLSHCLLSTLAVHLPNDFTPA
VHASLDKFLSSVSTVLTSKYR
>MYG_PHYCA Sw:Myg_Phyca => MYG_PHYCA
VLSEGEWQLVLHVWAKVEADVAGHGQDILIRLFKSHPETLEKFDRFKHLKTEAEMKASED
LKKHGVTVLTALGAILKKKGHHEAELKPLAQSHATKHKIPIKYLEFISEAIIHVLHSRHP
GDFGADAQGAMNKALELFRKDIAAKYKELGYQG
>GLB5_PETMA Sw:Glb5_Petma => GLB5_PETMA
PIVDTGSVAPLSAAEKTKIRSAWAPVYSTYETSGVDILVKFFTSTPAAQEFFPKFKGLTT
ADQLKKSADVRWHAERIINAVNDAVASMDDTEKMSMKLRDLSGKHAKSFQVDPQYFKVLA
AVIADTVAAGDAGFEKLMSMICILLRSAY
>LGB2_LUPLU Sw:Lgb2_Luplu => LGB2_LUPLU
GALTESQAALVKSSWEEFNANIPKHTHRFFILVLEIAPAAKDLFSFLKGTSEVPQNNPEL
QAHAGKVFKLVYEAAIQLQVTGVVVTDATLKNLGSVHVSKGVADAHFPVVKEAILKTIKE
VVGAKWSEELNSAWTIAYDELAIVIKKEMNDAA

Output file format

The output is a standard EMBOSS sequence file.

The results can be output in one of several styles by using the command-line qualifier -osformat xxx, where 'xxx' is replaced by the name of the required format. The available format names are: embl, genbank, gff, pir, swiss, dasgff, debug, listfile, dbmotif, diffseq, excel, feattable, motif, nametable, regions, seqtable, simple, srs, table, tagseq.

See: http://emboss.sf.net/docs/themes/SequenceFormats.html for further information on sequence formats.

Output files for usage example

File: globins.keep

>HBB_HUMAN Sw:Hbb_Human => HBB_HUMAN
VHLTPEEKSAVTALWGKVNVDEVGGEALGRLLVVYPWTQRFFESFGDLSTPDAVMGNPKV
KAHGKKVLGAFSDGLAHLDNLKGTFATLSELHCDKLHVDPENFRLLGNVLVCVLAHHFGK
EFTPPVQAAYQKVVAGVANALAHKYH
>HBA_HUMAN Sw:Hba_Human => HBA_HUMAN
VLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHFDLSHGSAQVKGHGK
KVADALTNAVAHVDDMPNALSALSDLHAHKLRVDPVNFKLLSHCLLVTLAAHLPAEFTPA
VHASLDKFLASVSTVLTSKYR
>MYG_PHYCA Sw:Myg_Phyca => MYG_PHYCA
VLSEGEWQLVLHVWAKVEADVAGHGQDILIRLFKSHPETLEKFDRFKHLKTEAEMKASED
LKKHGVTVLTALGAILKKKGHHEAELKPLAQSHATKHKIPIKYLEFISEAIIHVLHSRHP
GDFGADAQGAMNKALELFRKDIAAKYKELGYQG
>GLB5_PETMA Sw:Glb5_Petma => GLB5_PETMA
PIVDTGSVAPLSAAEKTKIRSAWAPVYSTYETSGVDILVKFFTSTPAAQEFFPKFKGLTT
ADQLKKSADVRWHAERIINAVNDAVASMDDTEKMSMKLRDLSGKHAKSFQVDPQYFKVLA
AVIADTVAAGDAGFEKLMSMICILLRSAY
>LGB2_LUPLU Sw:Lgb2_Luplu => LGB2_LUPLU
GALTESQAALVKSSWEEFNANIPKHTHRFFILVLEIAPAAKDLFSFLKGTSEVPQNNPEL
QAHAGKVFKLVYEAAIQLQVTGVVVTDATLKNLGSVHVSKGVADAHFPVVKEAILKTIKE
VVGAKWSEELNSAWTIAYDELAIVIKKEMNDAA

File: globins.redundant

>HBB_HORSE Sw:Hbb_Horse => HBB_HORSE
VQLSGEEKAAVLALWDKVNEEEVGGEALGRLLVVYPWTQRFFDSFGDLSNPGAVMGNPKV
KAHGKKVLHSFGEGVHHLDNLKGTFAALSELHCDKLHVDPENFRLLGNVLVVVLARHFGK
DFTPELQASYQKVVAGVANALAHKYH
>HBA_HORSE Sw:Hba_Horse => HBA_HORSE
VLSAADKTNVKAAWSKVGGHAGEYGAEALERMFLGFPTTKTYFPHFDLSHGSAQVKAHGK
KVGDALTLAVGHLDDLPGALSNLSDLHAHKLRVDPVNFKLLSHCLLSTLAVHLPNDFTPA
VHASLDKFLSSVSTVLTSKYR

Data files

For protein sequences EBLOSUM62 is used for the substitution matrix. For nucleotide sequence, EDNAFULL is used. Others can be specified.

EMBOSS data files are distributed with the application and stored in the standard EMBOSS data directory, which is defined by the EMBOSS environment variable EMBOSS_DATA.

To see the available EMBOSS data files, run:

% embossdata -showall

To fetch one of the data files (for example 'Exxx.dat') into your current directory for you to inspect or modify, run:


% embossdata -fetch -file Exxx.dat

Users can provide their own data files in their own directories. Project specific files can be put in the current directory, or for tidier directory listings in a subdirectory called ".embossdata". Files for all EMBOSS runs can be put in the user's home directory, or again in a subdirectory called ".embossdata".

The directories are searched in the following order:

. (your current directory)
.embossdata (under your current directory)
~/ (your home directory)
~/.embossdata

Notes

None.

References

None.

Warnings

None.

Diagnostic Error Messages

None.

Exit status

It always exits with status 0.

Known bugs

None.

Program name	Description
aligncopy	Read and write alignments
aligncopypair	Read and write pairs from alignments
biosed	Replace or delete sequence sections
codcopy	Copy and reformat a codon usage table
cutseq	Remove a section from a sequence
degapseq	Remove non-alphabetic (e.g. gap) characters from sequences
descseq	Alter the name or description of a sequence
entret	Retrieve sequence entries from flatfile databases and files
extractalign	Extract regions from a sequence alignment
extractfeat	Extract features from sequence(s)
extractseq	Extract regions from a sequence
featcopy	Read and write a feature table
featmerge	Merge two overlapping feature tables
featreport	Read and write a feature table
feattext	Return a feature table original text
listor	Write a list file of the logical OR of two sets of sequences
makenucseq	Create random nucleotide sequences
makeprotseq	Create random protein sequences
maskambignuc	Mask all ambiguity characters in nucleotide sequences with N
maskambigprot	Mask all ambiguity characters in protein sequences with X
maskfeat	Write a sequence with masked features
maskseq	Write a sequence with masked regions
newseq	Create a sequence file from a typed-in sequence
nohtml	Remove mark-up (e.g. HTML tags) from an ASCII text file
noreturn	Remove carriage return from ASCII files
nospace	Remove whitespace from an ASCII text file
notab	Replace tabs with spaces in an ASCII text file
notseq	Write to file a subset of an input stream of sequences
nthseq	Write to file a single sequence from an input stream of sequences
nthseqset	Read and write (return) one set of sequences from many
pasteseq	Insert one sequence into another
revseq	Reverse and complement a nucleotide sequence
seqcount	Read and count sequences
seqret	Read and write (return) sequences
seqretsetall	Read and write (return) many sets of sequences
seqretsplit	Read sequences and write them to individual files
sizeseq	Sort sequences by size
skipseq	Read and write (return) sequences, skipping first few
splitsource	Split sequence(s) into original source sequences
splitter	Split sequence(s) into smaller sequences
trimest	Remove poly-A tails from nucleotide sequences
trimseq	Remove unwanted characters from start and end of sequence(s)
trimspace	Remove extra whitespace from an ASCII text file
union	Concatenate multiple sequences into a single sequence
vectorstrip	Remove vectors from the ends of nucleotide sequence(s)
yank	Add a sequence reference (a full USA) to a list file

Author(s)

Jon Ison
European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SD, UK

History

Target users

This program is intended to be used by everyone and everything, from naive users to embedded scripts.

Comments

None

Wiki

Function

Description

Algorithm

Usage

Command line arguments

Input file format

Input files for usage example

File: globins.fasta

Output file format

Output files for usage example

File: globins.keep

File: globins.redundant

Data files

Notes

References

Warnings

Diagnostic Error Messages

Exit status

Known bugs

See also

Author(s)

History

Target users

Comments