This paper proposes a novel framework to segment hand gestures in RGB-depth (RGB-D) images captured by Kinect using humanlike approaches for human–robot interaction. The goal is to reduce the error of Kinect sensing and, consequently, to improve the precision of hand gesture segmentation for robot NAO. The proposed framework consists of two main novel approaches. First, the depth map and RGB image are aligned by using the genetic algorithm to estimate key points, and the alignment is robust to uncertainties of the extracted point numbers. Then, a novel approach is proposed to refine the edge of the tracked hand gestures in RGB images by applying a modified expectation–maximization (EM) algorithm based on Bayesian networks. The experimental results demonstrate that the proposed alignment method is capable of precisely matching the depth maps with RGB images, and the EM algorithm further effectively adjusts the RGB edges of the segmented hand gestures. The proposed framework has been integrated and validated in a system of human–robot interaction to improve NAO robot’s performance of understanding and interpretation.