
- Robots still fail quickly once removed from predictable factory environments
- Microsoft Rho-alpha links language understanding directly to robotic motion control
- Tactile sensing is central to narrowing gaps between software and physical action
Robots have long performed reliably inside tightly controlled industrial settings with predictable environments and limited deviations, but outside of that, they often struggle.
To alleviate this issue, Microsoft has announced Rho-alpha, the first robotics model derived from its Phi vision-language series, arguing robots need better ways to see and understand instructions
The company believes systems can operate beyond assembly lines by responding to changing conditions rather than following rigid scripts.
What Rho-alpha is designed to do
Microsoft links this to what is widely being called physical AI, where software models are expected to guide machines through less structured situations.
It combines language, perception, and action, which reduces dependence on fixed production lines or instructions.
Rho-alpha translates natural language commands into robotic control signals, and it focuses on bimanual manipulation tasks, which require coordination between two robotic arms and fine-grained control.
Microsoft characterizes the system as extending typical VLA approaches by expanding both perception and learning inputs.
“The emergence of vision-language-action (VLA) models for physical systems is enabling systems to perceive, reason, and act with increasing autonomy alongside humans in environments that are far less structured,” said Ashley Llorens, Corporate Vice President and Managing Director, Microsoft Research Accelerator
Rho-alpha includes tactile sensing alongside vision, with additional sensing modalities such as force, which is an ongoing development.
These design choices suggest an attempt to narrow the gap between simulated intelligence and physical interaction, though their effectiveness remains under evaluation.
A central part of Microsoft’s approach relies on simulation to address limited large-scale robotics data, particularly data involving touch.
Synthetic trajectories are generated through reinforcement learning within Nvidia Isaac Sim, then combined with physical demonstrations from commercial and open datasets.
“Training foundation models that can reason and act requires overcoming the scarcity of diverse, real-world data,” said Deepu Talla, Vice President of Robotics and Edge AI, Nvidia.
“By leveraging NVIDIA Isaac Sim on Azure to generate physically accurate synthetic datasets, Microsoft Research is accelerating the development of versatile models like Rho-alpha that can master complex manipulation tasks.”
Microsoft also emphasizes human corrective input during deployment, allowing operators to intervene using teleoperation devices and provide feedback that the system can learn from over time.
This training loop blends simulation, real-world data, and human correction, reflecting a growing reliance on AI tools to compensate for scarce embodied datasets.
Professor Abhishek Gupta, Assistant Professor, University of Washington, said, “While generating training data by teleoperating robotic systems has become a standard practice, there are many settings where teleoperation is impractical or impossible.”
“We are working with Microsoft Research to enrich pre-training datasets collected from physical robots with diverse synthetic demonstrations using a combination of simulation and reinforcement learning.”
Follow TechRadar on Google News and add us as a preferred source to get our expert news, reviews, and opinion in your feeds. Make sure to click the Follow button!
And of course you can also follow TechRadar on TikTok for news, reviews, unboxings in video form, and get regular updates from us on WhatsApp too.
Source: TechRadar