TDParser internals

Beyond the Reference section, here is an in-depth description of tdparser‘s internals.

Lexer helpers

This module holds the tdparser.Lexer class, which is available in the top-level tdparser module.

class tdparser.lexer.TokenRegistry

This class holds a set of (token, regexp) pairs, and selects the appropriate pair to extract data from a string.

Note

The TokenRegistry doesn’t interact with the Token subclasses provided through register().

This means that any kind of value could be provided for this field, and will be returned as-is by the get_token() method.

_tokens

Holds a list of (Token, re.RegexObject) tuples. These are the tokens in the order they were inserted (insertion order matters).

Type:list of (Token subclass, re.RegexObject) tuples
register(self, token, regexp)

Register a Token subclass for the given regexp.

Parameters:
  • token (tdparser.Token) – The Token subclass to register
  • regexp (str) – The regular expression (as a string) associated with the token
matching_tokens(self, text[, start=0])

Retrieve all tokens matching a given text. The optional start argument can be used to alter the start position for the match() call.

Parameters:
  • text (str) – Text for which matching (Token, re.MatchObject) pairs should be searched
  • start (int) – Optional start position with text for the regexp match() call
Returns:

Yields tuples of (Token, re.MatchObject) for each token whose regexp matched the text.

get_token(self, text[, start=0])

Retrieve the best token class and the related match at the start of the given text.

The algorithm for choosing the “best” class is:

  • Fetch all matching tokens (through matching_tokens())
  • Select those with the longest match
  • Return the first of those tokens

A different starting position for match() calls can be provided in the start parameter.

Parameters:
  • text (str) – Text for which the (Token, re.MatchObject) pair should be returned
  • start (int) – Optional start position with text for the regexp match() call
Returns:

(Token, re.MatchObject) pair, the best match for the given text.

If no token matches the text, returns (None, None).

__len__(self)

The len() of a TokenRegistry is the length of its _tokens attribute.