Do you have more info on video encoding process? You write: >We created a model ...

Do you have more info on video encoding process?

You write:

>We created a model without this tradeoff by training our video encoder on a masked compression objective

And I understand why this would give you more detail per token, but how are you reducing total number of tokens?