embword


Data type EmbPWordMatch

NUCLEUS data structure for word matches

Attributes

NameTypeDescription
seq1startajuintmatch start point in original sequence
seq2startajuintmatch start point in comparison sequence
sequenceconst AjPSeqneed in case we build multiple matches here so we know which one the match belongs to
lengthajintlength of match
Paddingchar[4]Padding to alignment boundary


Data type EmbPWord

NUCLEUS data structure for words

Attributes

NameTypeDescription
fwordconst char*Original word
seqlocsAjPTableTable of word start positions in multiple sequences
countajintTotal number of locations in all sequences
Paddingchar[4]Padding to alignment boundary


Data type EmbPWordSeqLocs

NUCLEUS data structure for word locations in a given sequence

Attributes

NameTypeDescription
seqconst AjPSeqSequence for word start positions
locsAjPListList of word start positions in the sequence


Data type EmbPWordRK

Data structure that extends EmbPWord objects for efficient access by Rabin-Karp search. It is constructed using embWordInitRabinKarpSearch method for a given sequence-set.

Possible improvements could be achieved by scanning all other words during preprocessing to find out a minimum length that can be skipped safely when a word is matched.

The first 5 fields (seqindxs-nseqs) are set during initialisation, and the last 3 fields (nMatches, lenMatches, nSeqMatches) are calculated during search.

Attributes

NameTypeDescription
wordconst EmbPWordOriginal word object
seqindxsajuint*Positions in the seqset for each sequence the word has been seen
nnseqlocsajuint*Number of word start positions for each sequence
locsajuint**List of word start positions for each sequence
hashajulongHash value for the word
nseqsajuintNumber of pattern-sequences word has been seen
nMatchesajuintNumber of matches in query sequences
lenMatchesajulongTotal length of extended matches
nSeqMatchesajuint*Number of matches recorded on per pattern-sequence base