TDParser internals

Beyond the Reference section, here is an in-depth description of tdparser‘s internals.

Lexer helpers

This module holds the tdparser.Lexer class, which is available in the top-level tdparser module.

class tdparser.lexer.TokenRegistry

This class holds a set of (token, regexp) pairs, and selects the appropriate pair to extract data from a string.


The TokenRegistry doesn’t interact with the Token subclasses provided through register().

This means that any kind of value could be provided for this field, and will be returned as-is by the get_token() method.


Holds a list of (Token, re.RegexObject) tuples. These are the tokens in the order they were inserted (insertion order matters).

Type:list of (Token subclass, re.RegexObject) tuples
register(self, token, regexp)

Register a Token subclass for the given regexp.

  • token (tdparser.Token) – The Token subclass to register
  • regexp (str) – The regular expression (as a string) associated with the token
matching_tokens(self, text[, start=0])

Retrieve all tokens matching a given text. The optional start argument can be used to alter the start position for the match() call.

  • text (str) – Text for which matching (Token, re.MatchObject) pairs should be searched
  • start (int) – Optional start position with text for the regexp match() call

Yields tuples of (Token, re.MatchObject) for each token whose regexp matched the text.

get_token(self, text[, start=0])

Retrieve the best token class and the related match at the start of the given text.

The algorithm for choosing the “best” class is:

  • Fetch all matching tokens (through matching_tokens())
  • Select those with the longest match
  • Return the first of those tokens

A different starting position for match() calls can be provided in the start parameter.

  • text (str) – Text for which the (Token, re.MatchObject) pair should be returned
  • start (int) – Optional start position with text for the regexp match() call

(Token, re.MatchObject) pair, the best match for the given text.

If no token matches the text, returns (None, None).


The len() of a TokenRegistry is the length of its _tokens attribute.