Robot manipulation relies on accurately predicting contact points and end-effector directions to ensure successful operation. However, learning-based robot manipulation, trained on a limited category within a simulato...
详细信息
ISBN:
(纸本)9798350353006
Robot manipulation relies on accurately predicting contact points and end-effector directions to ensure successful operation. However, learning-based robot manipulation, trained on a limited category within a simulator, often struggles to achieve generalizability, especially when confronted with extensive categories. Therefore, we introduce an innovative approach for robot manipulation that leverages the robust reasoning capabilities of multimodallargelanguagemodels (MLLMs) to enhance the stability and generalization of manipulation. By fine-tuning the injected adapters, we preserve the inherent common sense and reasoning ability of the MLLMs while equipping them with the ability for manipulation. The fundamental insight lies in the introduced fine-tuning paradigm, encompassing object category understanding, affordance prior reasoning, and object-centric pose prediction to stimulate the reasoning ability of MLLM in manipulation. During inference, our approach utilizes an RGB image and text prompt to predict the end effector's pose in chain of thoughts. After the initial contact is established, an active impedance adaptation policy is introduced to plan the upcoming way-points in a closed-loop manner. Moreover, in real world, we design a test-time adaptation (TTA) strategy for manipulation to enable the model better adapt to the current real-world scene configuration. Experiments in simulator and real-world show the promising performance of ManipLLM. More details and demonstrations can be found at https://***/view/manipllm.
Text-to-image person re-identification (ReID) retrieves pedestrian images according to textual descriptions. Manually annotating textual descriptions is time-consuming, restricting the scale of existing datasets and t...
详细信息
ISBN:
(纸本)9798350353006
Text-to-image person re-identification (ReID) retrieves pedestrian images according to textual descriptions. Manually annotating textual descriptions is time-consuming, restricting the scale of existing datasets and therefore the generalization ability of ReID models. As a result, we study the transferable text-to-image ReID problem, where we train a model on our proposed large-scale database and directly deploy it to various datasets for evaluation. We obtain substantial training data via multi-modal large language models (MLLMs). Moreover, we identify and address two key challenges in utilizing the obtained textual descriptions. First, an MLLM tends to generate descriptions with similar structures, causing the model to overfit specific sentence patterns. Thus, we propose a novel method that uses MLLMs to caption images according to various templates. These templates are obtained using a multi-turn dialogue with a largelanguagemodel (LLM). Therefore, we can build a large-scale dataset with diverse textual descriptions. Second, an MLLM may produce incorrect descriptions. Hence, we introduce a novel method that automatically identifies words in a description that do not correspond with the image. This method is based on the similarity between one text and all patch token embeddings in the image. Then, we mask these words with a larger probability in the subsequent training epoch, alleviating the impact of noisy textual descriptions. The experimental results demonstrate that our methods significantly boost the direct transfer text-to-image ReID performance. Benefiting from the pre-trained model weights, we also achieve state-of-the-art performance in the traditional evaluation settings.
Mathematical reasoning remains an ongoing challenge for AI models, especially for geometry problems, which require both linguistic and visual signals. As the vision encoders of most MLLMs are trained on natural scenes...
详细信息
In 1997, Sony announced AIBO, a fully autonomous small quadruped robot for home entertainment, and in 1999 the company began selling it as a consumer product. Soon after development, two small humanoid robots were ann...
详细信息
In 1997, Sony announced AIBO, a fully autonomous small quadruped robot for home entertainment, and in 1999 the company began selling it as a consumer product. Soon after development, two small humanoid robots were announced. One was QRIO by Sony, which is about 60cm height body with dynamical bipedal walking and whole-body cooperative control. The other was PINO by the ERATO Kitano Symbiotic Systems Project, which is about 70cm height body, Open HW/SW with inexpensive off-the-shelf component. In this paper, we revisit the two humanoids and nearly 20-year-old technologies, and discuss what were done 20 years ago, what have been achieved and what challenges are ahead.
暂无评论