You write:
>We created a model without this tradeoff by training our video encoder on a masked compression objective
And I understand why this would give you more detail per token, but how are you reducing total number of tokens?
You write:
>We created a model without this tradeoff by training our video encoder on a masked compression objective
And I understand why this would give you more detail per token, but how are you reducing total number of tokens?