Could this be made to work per token with Coca model?

I saw in the scripts you already have support for Coca models. Well it seems like a natural extension to make it be able to generate a caption, while dictating via the attention maps what aspects of the image corresponded to generating what tokens. Is this potentially on the radar or a method that could be integrated?