embword

Data type EmbPWordMatch

NUCLEUS data structure for word matches

Attributes

Name Type Description

seq1start ajuint match start point in original sequence

seq2start ajuint match start point in comparison sequence

sequence const AjPSeq need in case we build multiple matches here so we know which one the match belongs to

length ajint length of match

Padding char[4] Padding to alignment boundary

Name	Type	Description
seq1start	ajuint	match start point in original sequence
seq2start	ajuint	match start point in comparison sequence
sequence	const AjPSeq	need in case we build multiple matches here so we know which one the match belongs to
length	ajint	length of match
Padding	char[4]	Padding to alignment boundary

Data type EmbPWord

NUCLEUS data structure for words

Attributes

Name Type Description

fword const char* Original word

seqlocs AjPTable Table of word start positions in multiple sequences

count ajint Total number of locations in all sequences

Padding char[4] Padding to alignment boundary

Name	Type	Description
fword	const char*	Original word
seqlocs	AjPTable	Table of word start positions in multiple sequences
count	ajint	Total number of locations in all sequences
Padding	char[4]	Padding to alignment boundary

Data type EmbPWordSeqLocs

NUCLEUS data structure for word locations in a given sequence

Attributes

Name Type Description

seq const AjPSeq Sequence for word start positions

locs AjPList List of word start positions in the sequence

Name	Type	Description
seq	const AjPSeq	Sequence for word start positions
locs	AjPList	List of word start positions in the sequence

Data type EmbPWordRK

Data structure that extends EmbPWord objects for efficient access by Rabin-Karp search. It is constructed using embWordInitRabinKarpSearch method for a given sequence-set.

Possible improvements could be achieved by scanning all other words during preprocessing to find out a minimum length that can be skipped safely when a word is matched.

The first 5 fields (seqindxs-nseqs) are set during initialisation, and the last 3 fields (nMatches, lenMatches, nSeqMatches) are calculated during search.

Attributes

Name Type Description

word const EmbPWord Original word object

seqindxs ajuint* Positions in the seqset for each sequence the word has been seen

nnseqlocs ajuint* Number of word start positions for each sequence

locs ajuint** List of word start positions for each sequence

hash ajulong Hash value for the word

nseqs ajuint Number of pattern-sequences word has been seen

nMatches ajuint Number of matches in query sequences

lenMatches ajulong Total length of extended matches

nSeqMatches ajuint* Number of matches recorded on per pattern-sequence base

Name	Type	Description
word	const EmbPWord	Original word object
seqindxs	ajuint*	Positions in the seqset for each sequence the word has been seen
nnseqlocs	ajuint*	Number of word start positions for each sequence
locs	ajuint**	List of word start positions for each sequence
hash	ajulong	Hash value for the word
nseqs	ajuint	Number of pattern-sequences word has been seen
nMatches	ajuint	Number of matches in query sequences
lenMatches	ajulong	Total length of extended matches
nSeqMatches	ajuint*	Number of matches recorded on per pattern-sequence base