Add comment about incorrect T5 Tokenizer size calculation.

This commit is contained in:
Ryan Dick 2024-08-22 16:09:46 +00:00 committed by Brandon
parent d7c22b3bf7
commit 1c1f2c6664

View File

@ -57,6 +57,9 @@ def calc_model_size_by_data(logger: logging.Logger, model: AnyModel) -> int:
T5Tokenizer,
),
):
# HACK(ryand): len(model) just returns the vocabulary size, so this is blatantly wrong. It should be small
# relative to the text encoder that it's used with, so shouldn't matter too much, but we should fix this at some
# point.
return len(model)
else:
# TODO(ryand): Promote this from a log to an exception once we are confident that we are handling all of the