TDParser internals¶

Beyond the Reference section, here is an in-depth description of tdparser‘s internals.

`Lexer` helpers¶

This module holds the tdparser.Lexer class, which is available in the top-level tdparser module.

class tdparser.lexer.TokenRegistry¶

This class holds a set of (token, regexp) pairs, and selects the appropriate pair to extract data from a string.

Note

The TokenRegistry doesn’t interact with the Token subclasses provided through register().

This means that any kind of value could be provided for this field, and will be returned as-is by the get_token() method.

_tokens¶

Holds a list of (Token, re.RegexObject) tuples. These are the tokens in the order they were inserted (insertion order matters).

Type:	list of (`Token` subclass, `re.RegexObject`) tuples

register(self, token, regexp)¶

Register a Token subclass for the given regexp.

Parameters:	token (tdparser.Token) – The `Token` subclass to register regexp (str) – The regular expression (as a string) associated with the token

matching_tokens(self, text[, start=0])¶

Retrieve all tokens matching a given text. The optional start argument can be used to alter the start position for the match() call.

Parameters:	text (str) – Text for which matching (`Token`, `re.MatchObject`) pairs should be searched start (int) – Optional start position with `text` for the regexp `match()` call
Returns:	Yields tuples of (`Token`, `re.MatchObject`) for each token whose regexp matched the `text`.

get_token(self, text[, start=0])¶

Retrieve the best token class and the related match at the start of the given text.

The algorithm for choosing the “best” class is:

A different starting position for match() calls can be provided in the start parameter.

Parameters:

text (str) – Text for which the (Token, re.MatchObject) pair should be returned
start (int) – Optional start position with text for the regexp match() call

Returns:

(Token, re.MatchObject) pair, the best match for the given text.

If no token matches the text, returns (None, None).

__len__(self)¶: The len() of a TokenRegistry is the length of its _tokens attribute.