CosHAND: Controlling the World
by Sleight of Hand

Columbia University

CosHAND synthesizes a depiction of what the scene would look like after an interaction (defined by a provided query-hand mask) has occurred.

Abstract

Humans naturally build mental models of object interactions and dynamics, allowing them to imagine how their surroundings will change if they take a certain action. While generative models today have shown impressive results on generating/editing images unconditionally or conditioned on text, current methods do not provide the ability to perform fine-grained object manipulation conditioned on actions, an important tool for world modeling and action planning.

We propose learning to model interactions through a novel form of visual conditioning: hands. Given an input image and a representation of a hand interacting with the scene, our approach, CoSHAND, synthesizes a depiction of what the scene would look like after the interaction has occurred.

We show that CoSHAND is able to recover the dynamics of manipulation by learning from large amounts of unlabeled videos of human hands interacting with objects, and leveraging internet-scale latent diffusion model priors. CoSHAND demonstrates strong capabilities on a variety of actions and object types beyond the training dataset, and the ability to generate multiple possible futures depending on the actions performed. Our hand-conditioned model has several exciting applications in robotic planning and augmented or virtual reality

Method

We propose a novel approach of controlling by hands to enable manipulating objects in an image. Given an image, the corresponding hand mask, and a query hand mask of the desired interaction, CoSHAND synthesizes an image with the interaction applied. Such visual conditioning allows for fine-grained manipulations.
Method

Results on Something-Somethingv2 (training) Dataset

Results on Something-Somethingv2 dataset
We show that CosHAND can perform complex manipulations on a variety of rigid and deformable objects such as squeezing a lemon, closing a drawer, rotating a bottle, and placing items inside cups. Such interactions requires an understanding of deformable, articulated, and occluded objects.

Testing In-the-wild

Results on Something-Somethingv2 dataset
We testCoSHAND against challenging In-the-wild collected in our home/lab environments. CoSHAND> remains robust in these scenarios, showcasing its strong generalization ability.

Testing on Robot Arms

Results on Something-Somethingv2 dataset
While CoSHAND is only trained on hands, it can generalize to robot arms for simple actions. For example, moving objects around, picking up objects, unfolding cloth, and sweeping granular particles.

Comparing Conditioning Methods

Results on Something-Somethingv2 dataset
We show that text-conditioning is insufficient to perform fine-grained manipulation, whereas hands allow for better control. Columns 1 & 2 show the input image, query caption, and output of text conditional generation. Columns 3 & 4 show the input image, query hand mask, and output of CoSHAND. Column 5 shows the ground truth output. Notice that CoSHAND is able to achieve precise control (including the exact final location of the knife in row 1 and the precise squeezing motion in rows 2&3) which results in a output that is more consistent with the ground truth.

BibTeX


      TODO: ARIXV
    
Code for the website is inherited from Nerfies.