Humans naturally build mental models of object interactions and dynamics, allowing them to imagine how their surroundings will change if they take a certain action. While generative models today have shown impressive results on generating/editing images unconditionally or conditioned on text, current methods do not provide the ability to perform fine-grained object manipulation conditioned on actions, an important tool for world modeling and action planning.
We propose learning to model interactions through a novel form of visual conditioning: hands. Given an input image and a representation of a hand interacting with the scene, our approach, CoSHAND, synthesizes a depiction of what the scene would look like after the interaction has occurred.
We show that CoSHAND is able to recover the dynamics of manipulation by learning from large amounts of unlabeled videos of human hands interacting with objects, and leveraging internet-scale latent diffusion model priors. CoSHAND demonstrates strong capabilities on a variety of actions and object types beyond the training dataset, and the ability to generate multiple possible futures depending on the actions performed. Our hand-conditioned model has several exciting applications in robotic planning and augmented or virtual reality
TODO: ARIXV