Report Formats

Contents


Report Formats

When a report is produced of an analysis by an EMBOSS program, then the resulting output is written to a file.

There are many different programs in EMBOSS that create many different types of reports. Some of these programs have been incorporated into EMBOSS from pre-existing programs and some were specially written for it.

The resulting assortment of programs were starting to produce report output in a variety of different formats. It was decided that from EMBOSS version 2.2 onwards there should be a set of standard report formats.


Why have standard formats?

Standardising on a set of formats enables programs to be written that can read in the results from many different programs.

If you only intend to look at the resulting reports and not read them into any other programs, then it is still worth having a standard set of formats as you will very quickly get used to the look and feel of a format and be able to compare the reports from different programs more easily.

It is often convenient to have different report formats produced by the same program for different purposes. Depending on what you may wish to do with the result, it may be better to have a human readable report for publication purposes, or a less-readable report for input into another program for further analysis.

Different programs will have different default report formats. You may accept the default or choose your preferred format when you run the program.


What are the formats?

Head and tail of the format

The majority of the report formats (except those that are also standard sequence feature tables or other defined formats formats, like embl, genbank, gff, pir, swiss, excel, feattable) have a block of information at the start of the report describing the program, date, output filename, ID name of the sequence and some of the parameters and statistics of the report.

For example:

########################################
# Program: garnier
# Rundate: Mon Feb 11 15:14:40 2002
# Report_file: report.dbmotif
########################################

#=======================================
#
# Sequence: 100K_RAT     from: 1   to: 889
# HitCount: 206
#
# DCH = 0, DCS = 0
# 
#  Please cite:
#  Garnier, Osguthorpe and Robson (1978) J. Mol. Biol. 120:97-120
# 
# 
#
#=======================================

There is also a block of information at the end of the report for summary information.

For example:

 
#---------------------------------------
#
#  Residue totals: H:364   E:149   T:191   C:185
#         percent: H: 41.7 E: 17.1 T: 21.9 C: 21.2
# 
#
#---------------------------------------

Format names

The formats have been given names that correspond to the names of existing report styles or programs.

Some of the report formats can cope with an unlimited number of sequences, while others are only for single sequences or pairs of sequences.

Example report formats

The following are examples of garnier analysing sw:100K_rat output in various report formats.

NameComments
embl Writes a report in EMBL feature table format
genbank Writes a report in Genbank feature table format
gff Writes a report in GFF feature table format
pir Writes a report in PIR feature table format
swiss Writes a report in SwissProt feature table format
debug This is of use only for debugging.
listfile This writes out a list file with the start and end points of the motifs given by '[start:end]' after the sequence's full USA. This is useful as it is a true List File that can be read in by other EMBOSS programs using '@' or 'list::' before the filename.
dbmotif Writes a report in DbMotif format

Format:
  Length = [length]
  Start = position [start] of sequence
  End = position [end] of sequence
  ... other tags ... 
  [sequence]
  [start and end numbered below sequence with '|' marks]
  Blank line

Data reported: Length, Start, End, Sequence (5 bases around feature)
diffseq This format is most useful when reporting the results of two sequences aligned, as in the program diffseq.

The report describes matches, usually short, between two sequences and features which overlap them.

Format:
[Sequence 1 Name] [start]-[end] Length: [length] Feature: first sequence feature(s) Sequence: motif in sequence 1 Sequence: motif in sequence 2 Feature: second sequence feature(s) [Sequence 2 Name] [start]-[end] Length: [length] Blank line
excel This is a TAB-delimited table format suitable for reading into spread-sheet programs such as Excel.

Name, start, end and score are always reported. Other tags in the report definition are added as extra columns. All values are (for now) unquoted. Missing values are reported as '.'

feattable Writes a report in FeatTable format. The report is an EMBL feature table using only the tags in the report definition. There is no requirement for tag names to match standards for the EMBL feature table.

The original EMBOSS application for this format was cpgreport.

Format:
FT [type] [start]..[end] FT /[tagname]=[tagvalue] Blank line Data reported: Type, Start, End
motif Writes a report in Motif format. Based on the original output format of antigenic, helixturnhelix and sigcleave.

Format:
  (1) Score [score] length [length] at [name] [start->[end]
              *  (marked at position pos)
            [sequence]
            |        |
      [start]        [end]
  [tagname]: tagvalue

Data reported: Name, Start, End, Length, Score, Sequence

regions Writes a report in Regions format. The report (unusually for the current report formats) includes the feature type.

Format: [type] from [start] to [end] ([length] [name]) ([tagname]: [tagvalue], [tagname]: [tagvalue] ...)

Data reported: Type, Start, End, Length, Name

seqtable Writes a report in SeqTable format This is a simple table format that includes the feature sequence. See Table for a version without the sequence. Missing tag values are reported as '.' The column width is 6, or longer if the name is longer.

Format:
Start End [tagnames] Sequence [start] [end] [tagvalues] [sequence]
simple Writes a report in SRS simple format This is a simple parsable format that does not include the feature sequence (see also SRS format) for applications where features can be large. Missing tag values are reported as '.'

 Format:
   Feature [number]
   Name: [ID name]   
   Start:  [start]
   End: [end]
   Length: [length]
   [tagnames:]  [tag values]
   Blank line
srs Writes a report in SRS format This is a simple parsable format that includes the feature sequence. Missing tag values are reported as '.'

Format:
   Feature [number]
   Name: [ID name]   
   Start:  [start]
   End: [end]
   Length: [length]
   Sequence: [sequence]
   Score: [score]
   [tagnames:]  [tag values]
   Blank line
table Writes a report in Table format. See seqtable for a version with the sequence. Missing tag values are reported as '.' The column width is 6, or longer if the name is longer.

Format:
USA Start End Score [tagnames] [name] [start] [end] [score] [tagvalues]
tagseq Writes a report in Tagseq format. Features are marked up below the sequence. Originally developed for the garnier application, but has general uses.

Format:
  Sequence position written every 10 bases/residues
  Sequence (50 residues)
  tagname        ++++++++++++    +++++++++
  Blank line

If the tag value is a 1 letter code, use this instead of '+'


Changing the format

Each program that writes an report, has a default report format defined for that program. This format is usually table but other more appropriate formats are often chosen as the default.

You are not restricted to these default formats. You can use any format.

You specify the required format by putting the qualifier -rformat followed by the format name on the command line, for example:

garnier -rformat gff

Command-line qualifiers

There are other command-line qualifiers that change the behaviour of the report output.

  -rformat             string     report format
  -ropenfile           string     report file name
  -rextension          string     file name extension
  -rname               string     base file name
  -raccshow            bool       show accession number in the report
  -rdesshow            bool       show description in the report
  -rusashow            bool       show the full USA in the report
  -rdirectory	       bool	  report file output directory

Of these, -rformat and -rusashow might be the more useful.