We present a novel method to learn Personalized Implicit Neural Avatars (PINA) from a short RGB-D sequence. This allows non-expert users to create a detailed and personalized virtual copy of themselves, which can be animated with realistic clothing deformations. PINA does not require complete scans, nor does it require a prior learned from large datasets of clothed humans. Learning a complete avatar in this setting is challenging, since only few depth observations are available, which are noisy and incomplete (i.e.only partial visibility of the body per frame). We propose a method to learn the shape and non-rigid deformations via a pose-conditioned implicit surface and a deformation field, defined in canonical space. This allows us to fuse all partial observations into a single consistent canonical representation. Fusion is formulated as a global optimization problem over the pose, shape and skinning parameters. The method can learn neural avatars from real noisy RGB-D sequences for a diverse set of people and clothing styles and these avatars can be animated given unseen motion sequences.
To reconstruct personalized avatars with realistic clothing deformations during animation, we propose to learn the shape and non-rigid surface deformations via a pose-conditioned implicit surface and a deformation field, defined in canonical space.
Learning an animatable avatar from a monocular RGB-D sequence is challenging since raw depth images are noisy and only contain partial views of the body. At the core of our method lies the idea to fuse partial depth maps into a single, consistent representation and to learn the articulation-driven deformations at the same time. To do so, we parametrize the 3D surface of clothed humans as a pose-conditioned implicit signed-distance field (SDF) and a learned deformation field in canonical space.
Training is formulated as global optimization to jointly optimize the per-frame pose, shape and skinning fields without requiring prior knowledge extracted from large datasets. Based on our model parametrization, we transform the canonical surface points and the spatial gradient into posed space, enabling supervision via the input point cloud and its normals.
Here, we show some of our results and all avatars have been learned from real-world RGB-D data
Without learned skinning weights, the deformed regions can be noisy and display visible artifacts.
Without pose-dependent features, the shape network cannot represent dynamically changing surface details.