SigLoMa: a quadruped robot that learns manipulation in the real world using vision alone

LLM & Modèles 🟢 Beginner ⏱️ 14 min read 📅 2026-05-06

SigLoMa: a quadruped robot that learns manipulation in the real world using only its vision

🔎 Robotics is finally leaving the lab

For years, mobile robotics research has been hitting the same wall: everything works in simulation, everything crashes in the real world. The famous sim-to-real gap — this discrepancy between the perfect laws of physics in a simulator and the chaos of a real environment — has slowed down the deployment of utility robots.

SigLoMa breaks this cycle. This system, detailed in the paper SigLoMa: Learning Open-World Quadrupedal Loco-Manipulation from Ego-Centric Vision (May 2025), teaches a quadruped to move and manipulate objects in real spaces, without ever going through a simulation.

The approach relies entirely on egocentric vision: a single camera mounted on the robot. No motion capture, no environment calibration, no pre-existing 3D models. The robot observes, acts, and learns from its own interactions.

This result comes right on time. Following the race for humanoid robots that is gaining momentum, the industry is realizing that locomotion alone is no longer enough. A robot that knows how to walk but doesn't know how to open a door or grab a package remains a lab gadget. SigLoMa tackles exactly this gap: merging movement and manipulation on a quadruped platform.

The essentials

SigLoMa learns loco-manipulation (walking + grasping) in the real world, without prior simulation.
The system uses only the egocentric vision of an onboard camera, without external sensors or motion capture.
The method solves the sample efficiency problem — the robot learns in a few tens of minutes, not in thousands of hours.
The results are validated on the Unitree Go1 quadruped robot in unprepared domestic environments.
It is a rare case of success in the open-world: the robot generalizes to objects and places never seen during training.

Recommended tools

Why loco-manipulation is the Holy Grail of mobile robotics

A robot must do two things: go somewhere and do something useful there. The community has treated these problems separately for decades.

Quadruped locomotion exploded in 2020-2022 with work like that from ETH Zurich and MIT. Robots walk, run, get up after a fall. Impressive on video, but limited in practice: a robot dog that crosses a room without interacting with anything has no industrial utility.

Manipulation, on the other hand, has progressed via robotic arms mounted on fixed bases. A fixed arm grasps, sorts, assembles. But it is trapped on its base. It doesn't go get the object: it waits for it to be brought to it.

Loco-manipulation merges the two. The robot moves to an object, grabs it, carries it. This is what turns a demonstration robot into a utility machine. It is also what Boston Dynamics aims for with Atlas: a humanoid that walks in a workshop and manipulates parts.

But on a quadruped, the challenge is additional. The whole body vibrates while walking. The head (and therefore the camera) constantly oscillates. Grasping an object with a gripper mounted on a moving back is like trying to thread a needle while running on a treadmill.

SigLoMa shows that it's possible, and with disconcerting elegance: a single camera, no other sensor, no software safety net.

The sim-to-real gap: why everyone suffers from it

The dominant approach in learning robotics is called sim-to-real. The control policy is trained in a physics simulator (MuJoCo, Isaac Sim, PyBullet), then transferred to the real robot.

The problem? The real world is ungrateful.

A simulator models gravity, friction, and the mass of objects with fixed parameters. In reality, a table's friction varies according to humidity, temperature, and surface wear. An object slides differently from one day to the next. The robot's motors have mechanical play, calibration drifts, and unmodeled elasticities.

To bridge this gap, researchers use domain randomization techniques: they randomly vary physical parameters in simulation (mass, friction, latency) to force the policy to be robust. This works partially for pure locomotion. For fine manipulation, it's catastrophic.

Manipulation demands millimetric precision that randomization destroys. Too robust, the policy becomes coarse and misses grasps. Not robust enough, it fails at the first change in real conditions.

SigLoMa completely bypasses the problem. No simulator, no gap to bridge. The robot learns directly in the real world, with all its complexity and imperfections. The cost lies elsewhere: sample efficiency.

Egocentric vision as the sole sensor

Most loco-manipulation approaches use an arsenal of sensors: LiDAR for 3D mapping, depth cameras (RGB-D) for object perception, IMU for orientation, joint encoders for proprioception.

SigLoMa uses a standard RGB camera. Only.

Egocentric vision — what the robot sees from its own point of view — is processed by a vision network that directly extracts the information necessary for action. No intermediate 3D reconstruction step, no SLAM, no classic object detection like YOLO.

This approach echoes recent advances in AI vision for analyzing images with LLMs: instead of breaking the image down into sub-tasks (detection, segmentation, depth estimation), an end-to-end network is left to learn the direct correspondence between pixels and motor commands.

The advantage is considerable. By removing the classic perception pipeline, SigLoMa eliminates calibration errors between sensors. The camera is fixed on the robot, it moves with it — motion artifacts become information, not noise.

The network implicitly learns to compensate for walking oscillations. It understands that the object moving in the field of vision does so partly because the robot is walking, and it adjusts its commands accordingly. This is embodied learning in its purest form.

The SigLoMa architecture: how it actually works

SigLoMa relies on two main modules: a vision encoder and an actionable control policy.

The vision encoder

Images from the egocentric camera pass through a pre-trained convolutional network (ResNet type). This encoder extracts compact visual features that capture both the surrounding scene and the robot's state (its visible legs in the field of view, for example).

The key trick: the encoder also integrates the recent history of images, not just the current one. This short temporal memory allows the system to infer dynamics — how the scene evolves, at what speed the robot is moving, in which direction.

The control policy

The visual features are concatenated with the task commands (the goal: "move toward the chair and grab the cardboard box") and passed into an MLP (Multi-Layer Perceptron) network that directly outputs the motor torques for the 12 joints of the quadruped plus the gripper command.

No trajectory planner. No locomotion/manipulation separator. A single network, a single decision at each timestep: what torque to apply to each motor. The loco-manipulation fusion emerges naturally from training.

The challenge of sample efficiency in the real world

Training in the real world poses a brutal problem: the robot wears out. Every fall, every collision, every hour of exploration consumes hardware. A robot that requires 100,000 training episodes to learn a task is unusable in practice — it will be in pieces before it has learned anything.

This is the problem of sample efficiency: how many real experiences are needed to reach competent behavior?

SigLoMa solves this problem through a combination of three techniques.

Imitation learning from a teleoperated expert

Rather than starting from scratch (pure reinforcement learning), researchers first provide demonstrations. A human teleoperates the robot to perform the target task a few dozen times. The policy first learns to imitate, then refines itself.

Filtering low-quality data

Not all demonstrations are equal. Some are hesitant, others miss the target. SigLoMa integrates a scoring mechanism that filters out low-quality trajectories before training. Only the best of the best is kept.

Online adaptation

Once deployed, the robot continues to learn from its own experiences. Successful episodes reinforce the policy, while failures are updated as counter-examples. This online adaptation process allows for progressive generalization to new environments without returning to the lab.

The result: the robot achieves competent behavior after about 40 to 60 demonstration episodes, equivalent to a few hours of data collection. This is an order of magnitude lower than real-world RL-from-scratch approaches.

Results: what the robot actually does

The experiments are conducted on the Unitree Go1, a consumer-grade quadruped equipped with a 2-finger gripper mounted on its torso. The environment is a university office — not a prepared lab, not a perfectly flat surface.

Validated tasks

The robot accomplishes four loco-manipulation tasks:

Ground object grasping: the robot walks up to an object placed on the ground (cardboard, bottle, box), stops, adjusts its posture, and grasps the object with its gripper.
Tabletop object grasping: the robot approaches a table, rears up on its hind legs to reach the height, and grasps the object.
Object transport: after grasping, the robot walks toward a target point while holding the object in its gripper, adapting to ground irregularities.
Door opening: the robot approaches a door, inserts its gripper into the handle, and pushes while moving laterally.

Quantitative performance

The success rate ranges from 60% to 85% depending on the task and environment. These figures may seem modest compared to a fixed robotic arm (which achieves 95%+). But for open-world quadruped loco-manipulation, this is an unprecedented result.

Failures primarily stem from extreme situations: an object too heavy for the Go1's gripper, a surface too slippery for locomotion, or total occlusion of the object during the approach.

Open-world generalization

The most impressive point is the generalization. The robot is trained in a specific office with a set of objects. Tested in a different corridor with never-before-seen objects (a shoe, a plastic cup), it succeeds without retraining.

The policy has learned sufficiently generic manipulation primitives to transfer to new contexts. This is exactly what specialized approaches were missing.

Comparison with the state of the art

Approach	Simulation required?	Sensors	Fused loco-manipulation?	Open-world generalization?	Typical success rate
Classic sim-to-real (ETH, 2023)	Yes	IMU + LiDAR + RGB-D	Partial	No	40-55%
Real RL from scratch (UC Berkeley, 2024)	No	IMU + encoders	No (locomotion only)	No	70-80% (loco)
Modular approach (separate loco + arm, 2024)	Yes	Multi-sensors	No (sequential)	Limited	50-65%
SigLoMa (2025)	No	1 RGB camera	Yes	Yes	60-85%

The table speaks for itself. SigLoMa is the only approach that checks all the boxes: no simulation, minimal sensor, true fusion of locomotion and manipulation, and open-world generalization.

The system's honest limitations

Despite its lead, SigLoMa has weaknesses that the paper does not hide.

The Go1's gripper is rudimentary. Two parallel fingers, no fine grasping. The robot can only grasp objects of compatible shape and size — a pen or a spoon are out of reach. Moving to a 5-finger anthropomorphic hand is not trivial and would require a complete redesign of the policy.

The manipulation remains simple prehensile grasping. No assembly, no tool use, no interaction with complex mechanisms. Opening a door by pushing a handle is one thing; turning a key in a lock is another.

Robustness to external disturbances remains to be tested. What happens if someone pushes the robot while it is carrying an object? The paper does not document this scenario.

Finally, sample efficiency, although improved, remains a limiting factor for commercial deployment. Forty demonstration episodes per task is reasonable for research. It is still too much for an end-user who wants a functional out-of-the-box robot.

What this implies for the AI Skills system

SigLoMa is not just a robotics result. It is a physical illustration of a concept that runs through all of AI: the learning of composable skills.

In software AI, the Skills system allows an agent to acquire specific capabilities (summarizing a document, searching the web, generating code) and combine them to solve complex tasks. The agent learns a skill, stores it, reuses it in a new context.

SigLoMa does the same thing in the physical world. The policy learns primitives — "move toward an object", "grasp", "transport" — which compose to achieve varied tasks. Open-world generalization is exactly the robotic equivalent of skill transfer between contexts.

The convergence is striking. Software agents are becoming embodied (they act in interfaces, APIs, real environments). Robots are becoming agents (they plan, compose skills, adapt). The boundary between software AI and hardware AI is blurring.

❌ Common mistakes

Mistake 1: Confusing locomotion with loco-manipulation

What people do: they watch a video of SigLoMa and say "but this already exists, quadrupeds have known how to walk since 2020".

What's wrong: locomotion is pure movement. Loco-manipulation is the ability to physically interact with objects during or at the end of the movement. The difference is as big as between knowing how to drive and knowing how to deliver a package while driving.

The solution: evaluate robotic systems on what they do in the environment, not just on their ability to traverse it.

Mistake 2: Underestimating the cost of sim-to-real

What people do: they assume that because a simulator is "realistic", the transfer to a real robot is trivial.

What's wrong: even the most advanced simulators (NVIDIA's Isaac Sim) introduce systematic biases in fine manipulation. Static vs dynamic friction, elastic deformations of objects, real communication latencies — no simulator captures all of this faithfully.

The solution: take sim-to-real results with a grain of salt. A 90% success rate in simulation often translates to 50% on the robot. SigLoMa avoids this problem entirely.

Mistake 3: Judging robotic manipulation with industrial standards

What people do: they compare SigLoMa's success rate (60-85%) to that of a KUKA arm in a factory (99%+).

What's wrong: the KUKA arm is in a controlled environment, fixed to the ground, with calibrated objects and a deterministic program. SigLoMa operates in the open world, on an unstable mobile platform, with a single camera.

The solution: compare what is comparable. SigLoMa's metrics must be judged against other open-world loco-manipulation approaches, not against industrial arms in an automated cell.

❓ Frequently Asked Questions

Does SigLoMa work with any quadruped?

Not exactly. The policy is trained for the specific morphology of the Unitree Go1. Adapting it to another robot requires at least partial re-training or transfer via a similar morphology. The general principle remains valid, but there is no plug-and-play.

Why not use an RGB-D depth camera?

An RGB-D camera would provide useful geometric information, but it adds weight to the system, consumes more energy, and performs poorly outdoors (sunlight disrupts infrared sensors). The strength of SigLoMa is showing that depth is not necessary — the network infers it implicitly from motion and parallax.

What is the connection with humanoid robots like Figure or Atlas?

SigLoMa demonstrates the fundamental principles of loco-manipulation on a quadruped platform, which is simpler and less expensive. These principles (egocentric vision, loco-manipulation fusion, real-world learning) are directly transferable to humanoids, which are the ultimate industrial target.

How long does it take to deploy SigLoMa on a new task?

With an experienced operator, collecting demonstrations takes 1 to 2 hours. Training the policy takes a few additional hours on a GPU. It is therefore possible to go from a new task to a functional robot in half a day, which is exceptional for real-world learning.

✅ Conclusion

SigLoMa demonstrates that a quadruped robot can learn to move and manipulate objects in the real world with only a camera, without ever going through simulation. This is proof that the sim-to-real gap is not inevitable — it is a problem that can simply be bypassed.

The message is clear for the industry: stop building increasingly complex simulators, invest in directly embodied learning methods. The utility robots that will come out of the labs will not be the ones that best simulate the world — they will be the ones that don't need to.

#intelligence-artificielle #sigloma #robot-quadrupede #robotique #sim-to-real

📚 Related articles

LLM & Modèles 🟢 Débutant 12 min

Claude Sonnet 5: Anthropic's most agentic model, Opus performance at Sonnet price

2026-07-01 15:02