研究計画 · 2026-07-02 更新 · 対象: motion生成/拡散モデル/強化学習の概念を知る読者

Effect-Grounded Motion (EGM)

「ドアを開けて」という指示と部屋の3D形状から、実際にドアが開く人間の動きを生成できるモデルを作る。鍵となるアイデアは、生成した動きを物理シミュレータの中で再生し、世界に起きた変化（＝効果）を教師信号にして生成モデル自体を育てること。目標は CVPR 2027（締切 2026年11月中旬）。

閉ループの成立を実証済み（成功率 0% → 75% → 100%）現在: 未知のシーンへの汎化に取り組み中

Goal

目指すもの: 「見た目が良い動き」ではなく「機能する動き」

図1 · 最終形。ロボット制御器を介さず、生成モデルが直接「タスクを達成する動き」を出せるようにする。良い動きの生成モデルは、アニメーション・ヒューマノイドの教師データ・VR など下流にそのまま使える。

State of the art

先行研究ができていること / できていないこと

✓ できている指示文から自然な動き「歩く」「座る」等の文から、人間らしい動きを生成できる（MDM 以降の text-to-motion 拡散モデル群）。NVIDIA の Kimodo（我々の土台）はその公開最新版。

✓ できているシーンに幾何的に沿わせる壁にめり込まない・床に足が着く・物体の近くに座る、といった形の整合（SceMoS, LINGO など）。

✗ できていないタスクを本当に達成したか「ドアが開いたか」「物が目的地に着いたか」は誰も測っておらず、学習信号にも使っていない。評価は見た目の自然さ（FID等）か幾何整合止まり。

図2 · 現状の到達点。3列目が本研究の標的。原因は2つ: (1) 「文 × シーン × 動き」の3点セットで、しかもタスク達成が記録されたデータがほぼ存在しない（最大級の実写データセットでもドアは「開かない壁の穴」として記録されている）。(2) 達成判定には物理が必要だが、物理は微分できないので普通の学習に組み込めない。

近い研究はどこまで来ているか（2025–26）

「シミュレータを学習に使う」研究は増えているが、「タスク達成度を報酬にして、生成モデル本体を育てる」組み合わせはまだ空白。下の表は最近接の5本と、何が違うか。

研究	やったこと	我々との違い
RLPF (2025)	「ロボットが追従できる動きか」を報酬に、生成モデルを強化学習で調整	報酬は動きの実行しやすさ（見た目系）。物体・シーン・タスク達成は見ていない
SimGenHOI (2025)	生成モデルとロボット制御器を交互に鍛える。物体運搬の成功率も報告	成功は「生成+制御器」の合わせ技の成績。使う時も制御器が必須で、生成モデル単体は育たない
PhyMotion (2026)	生成動画から人の動きを取り出しシミュレータで採点、動画モデルを調整	機構はほぼ同じだが、採点軸は全部「物理的に不自然でないか」。タスク達成は測らない
VLK (2026, Amazon)	実部屋のスキャン内で動きを合成→視覚付きデータ4.8万本→実機ヒューマノイドで動作	データの質チェックは「めり込み除去」のみで、生成モデルは一度学習したら固定。タスク達成は評価にだけ使う。むしろ我々の改善したモデルの出口になる補完研究
SceMoS (CVPR'26)	シーン適合motion生成の最新最強（教師あり）	シミュレータも達成判定も無し。我々の最重要比較対象

Method

我々の方法: 生成 → 物理で採点 → 良いものだけで再学習、を回す

1 · 生成モデルに「経由地点」を指定してドア付近を通る動きを大量に作らせる（拡散モデル純正の条件付け機能。探索用で、完成品では不要）

2 · 物理再生動きをシミュレータ内の人体で再生。物体（ドア等）は物理法則で応答する

3 · 採点と選別採点は2段: 効果（ドアが60°以上開いたか）と自然さフィルタ（めり込み・転倒・関節の詰まりを弾く）。両方合格だけ残す

4 · 再学習合格した動きでモデルを追加学習。元モデルを先生役に置く工夫で、言語理解を壊さずに新しい能力だけ足す

5 · 評価学習に使っていない配置で、経由地点の指定なしにタスク成功率を測る

↺ 4で賢くなったモデルで1に戻る（回すたびに収穫の質も上がる）

図3 · 学習ループ。ポイントは「シミュレータは学習時だけの審判で、完成したモデルは指示文とシーンだけで動く」こと。審判は無限に、無償で、自動でラベルを付けてくれる。

自然さフィルタが要る理由（ズル防止）

シミュレータ内の人体は「無限の力」を出せるため、体でドアをなぎ倒しても「開いた」ことになってしまう。実測では、素のモデルの「成功」は全てこの type のズルだった（下表）。フィルタはこのズルを学習データから排除する安全装置で、本手法の生命線。

観察	数値	意味
素のモデルの「ドア開き成功」	25%	ただし全て体当たり（体がドアに15cmめり込む）
うち「自然さも合格」した成功	0%	見た目の自然さとタスク達成は別物という証拠。これ自体が論文の主結果の1つ

Why us

なぜ解決できると考えるか — 3つの根拠（全て実測済み）

根拠1: 閉ループが実際に回った（成功率 0% → 75% → 100%）

学習前

75%

1周後

100%

2周後

固定シーンでの実測。学習前は一切開けられなかったドアを、ループ2周（データ 185本・追加学習は計 30 分程度）で毎回開けられるようになった。言語理解の劣化はゼロ（多様な指示文への反応が学習前後で 100–102% を維持）。

根拠2: 教師データを無限に作れる「工場」が動いている

ドアの位置と向きをランダムに変えた 64 シーンで経由地点付き生成を行うと、80% が合格品（256本中205本）として収穫できた。人手のモーションキャプチャでは1シーン分の収録に日単位かかるものが、GPU 1枚・数分で「シーン × 指示文 × 達成済みの動き」の3点セットとして量産できる — データ希少性というこの分野の根本問題を反転させる装置になっている。

根拠3: 物理の審判は、幾何チェックでは見えない誤りまで検出する

実例: 開発中、収穫が3回連続で全滅した。原因は「ドアは片側にしか開かない」のに、生成した人が開かない側から押していたこと。めり込みや接触などの幾何チェックでは原理的に検出できない誤りを、「押しても蝶番の角度が 0° のまま」という物理効果が一発で暴いた。効果信号の情報量を示す好例で、論文の逸話としても使う。

正直な現在地

未知のドア配置への汎化はまだ弱い（学習に使っていない16配置での成功率 8.6% — 何もしない場合の 0〜2% よりは明確に高いが目標の 30% には未達）。対照実験3本で原因を「シーン情報の読み取り学習の量不足」と特定済みで、データを 6 倍に増やした学習を実行中。ここが今の主戦場。

役割	研究	一言
比較対象（シーン適合生成）	SceMoS · CVPR'26	最強の教師ありシーン適合。我々の採点器に通して「幾何は合うが達成しない」ことを示す相手
LINGO · SIGGRAPH Asia'24	指示文→シーン内行動の連鎖生成 + データセット
データセット	TRUMANS · CVPR'24	最大級の人×シーン同時計測。ただし物体は固定でドアの開閉記録なし
ParaHome · CVPR'25	唯一、ドア角度など「物体の状態変化」を記録。ただし1部屋のみ
Nymeria+ · 2026	最大の実世界モーション。ドアは「開かない開口部」として記録 = データ希少性の核心的証拠
RoboCasa · RSS'24	シミュレータ内キッチン+タスク達成判定器。我々の判定器設計の参照元
生成モデル×強化学習	RLPF · 2025	報酬=追従可能性。生成モデルを sim 報酬で育てた先行例（報酬が違う）
SimGenHOI · 2025	生成×制御器の相互学習（制御器が主役）
PhyMotion · 2026	同じ機構を動画モデルに（報酬は自然さのみ）
Morph / CLoSD	物理での修正・追跡はするが、生成モデルに学習として戻さない
下流・補完	VLK · 2026	合成モーション→実機ヒューマノイド。我々の改善モデルの「出口」の実在証明
Kimodo · NVIDIA 2026	我々の土台となる公開 motion 拡散モデル（シーンは扱わない、と明記されている）
部品	Eureka 系 · 2023–	LLM に採点基準を書かせる路線。我々のタスク自動拡張（フェーズC）の部品
Sonata · 2025	点群→特徴ベクトルの事前学習エンコーダ。シーン入力の変換に使用

計画とマイルストーン

段階	期間	内容	合格条件と状態
基盤	7月	学習ループ一式の実装（人体モデル移植・再学習器・探索）	完了（予定より1ヶ月早い）
閉ループ実証	8月	固定シーンで成功率が回すたびに上がることを示す	完了 — 0%→75%→100%（2ヶ月前倒し）
シーン汎化	現在	部屋の3D形状を入力に加え、未知の配置でも成功させる	進行中 — 現在8.6%、目標30%。データ6倍の学習を実行中
スケールと信頼性	9–10月	タスクを5–6種類に拡張（引き出し・押し運び・着座など）· 既存手法を我々の採点器で測る比較 · 採点器自体の信頼性検証（人間との一致率など）	タスク4種以上 + 比較2本以上
仕上げ	10–11月	ユーザースタディ · 分析実験 · 執筆	11/13 投稿

Glossary

用語ミニ辞典

本ページの言葉	分野での呼び名	意味
motion 生成モデル	motion diffusion prior	指示文などから人の動き（骨格の時系列）を作る拡散モデル
経由地点の指定	guidance / constraint	「このフレームで root がこの座標」を生成時に条件として与える機能。探索専用で完成品には不要
選別して再学習するループ	ReST（rejection-sampling self-training）	自分の出力を審査して合格品だけで再学習する自己改善法。LLM で実績のある枠組み
元モデルを先生役に置く工夫	anchor self-distillation	追加学習中も「元のモデルの答え」に近さを保つ項を混ぜ、言語理解の破壊（破滅的忘却）を防ぐ
自然さフィルタ	plausibility gate	めり込み・転倒・関節の詰まりを検出して不合格にする物理チェック。ズル（reward hacking）防止
シーンの要約ベクトル	scene tokens（Sonata）	部屋の点群を事前学習エンコーダで 64 個のベクトルに圧縮し、生成モデルの入力に追加したもの
タスク成功率	task success rate	「効果あり（例: ドア60°以上）かつ自然さ合格」の割合。学習に使っていない配置・乱数で測る

EGM · CMU × 慶應 IsogawaLab · 実装: ayan221/effect-grounded-motion（378 unit tests・全機能レビュー済み）· 数値は 2026-07-02 時点の実測。シーン汎化のスケール実験が実行中。

Research plan · updated 2026-07-02 · for readers who know diffusion / RL / motion generation as concepts

Effect-Grounded Motion (EGM)

Build a model that, given "open the door" and the room's 3D shape, generates human motion that actually opens the door. The key idea: replay generated motion inside a physics simulator, and use what changed in the world (the effect) as the training signal for the generator itself. Target: CVPR 2027 (deadline mid-Nov 2026).

Closed loop demonstrated (success 0% → 75% → 100%) Now: generalizing to unseen scenes

Goal

Motion that works, not motion that merely looks right

Fig 1 · The end state: no robot controller in the loop — the generator alone produces task-completing motion, directly usable for animation, humanoid training data, and VR.

State of the art

What prior work can and cannot do

✓ solvedNatural motion from textHuman-like motion from prompts like "walk" or "sit" (MDM-line diffusion models). NVIDIA's Kimodo (our base) is the latest public one.

✓ solvedGeometric scene fitNo wall clipping, feet on the floor, sitting near objects — shape agreement (SceMoS, LINGO).

✗ openDid the task actually succeed?Whether the door opened or the object reached its goal is neither measured nor used as a training signal. Evaluation stops at visual plausibility (FID) or geometric fit.

Fig 2 · The third column is our target. Two root causes: (1) datasets pairing text × scene × motion with recorded task outcomes barely exist (the largest real capture records doors as non-articulating wall openings); (2) success verification needs physics, and physics is non-differentiable, so it does not fit ordinary training.

How close is the 2025–26 wave?

Simulator-in-the-loop training is spreading, but "task-completion reward × improving the generator itself" is still unclaimed. The five nearest works:

Work	What it did	Difference from us
RLPF (2025)	RL-tunes a motion generator with a "can a robot track this?" reward	Reward is executability (plausibility-class); no objects, scenes, or task outcomes
SimGenHOI (2025)	Alternates training a generator and a robot controller; reports carry-success	Success belongs to the generator+controller pair; the controller is required at deployment
PhyMotion (2026)	Scores motion extracted from generated video in a simulator; tunes the video model	Nearly our mechanism, but every scoring axis is naturalness — no task completion
VLK (2026, Amazon)	Synthesizes motion in scanned rooms → 48k vision-paired clips → real humanoid	Quality control is only clipping removal; the generator is frozen; success is used for evaluation only. Complementary — the downstream consumer for our improved generator
SceMoS (CVPR'26)	Strongest supervised scene-fit generator	No simulator, no outcome check. Our primary baseline

Method

Our method: generate → grade with physics → retrain on the winners → repeat

1 · GenerateGive the model waypoints so its samples pass near the door (a built-in conditioning feature of the diffusion model — used for exploration only, not at deployment)

2 · Physics replayPlay each motion on a simulated body; objects (the door) respond by physics

3 · Grade & selectTwo checks: effect (did the hinge pass 60°?) and a naturalness filter (no clipping, falling, jammed joints). Keep only motions passing both

4 · RetrainFine-tune the generator on the winners, with the original model as a teacher so language understanding is preserved

5 · EvaluateMeasure task success on placements never used in training, with no waypoints

↺ the improved model returns to step 1 (each pass also harvests better data)

Fig 3 · The loop. The simulator is a training-time judge only — the finished model runs from text + scene alone. The judge labels data infinitely, automatically, for free.

Why the naturalness filter is essential (anti-cheating)

The simulated body can exert unlimited force, so bulldozing the door also counts as "opened". Measured on the base model, every single "success" was this kind of cheat:

Observation	Number	Meaning
Base model "door opened"	25%	All by body-ramming (15 cm of interpenetration)
… of which also physically natural	0%	Looking right and working are different things — itself one of the paper's headline results

Why us

Why we believe this is solvable — three measured reasons

1 · The loop already closed (success 0% → 75% → 100%)

before

75%

after 1 pass

100%

after 2 passes

On a fixed scene: a door the model could never open becomes one it opens every time, after two loop passes (185 examples, ~30 min of fine-tuning). Language understanding unchanged (response diversity to varied prompts stays at 100–102% of the original).

2 · A working factory for unlimited training data

Across 64 randomly placed doors, waypoint-guided generation yields 80% keepers (205 of 256). What takes days per scene with motion capture takes minutes on one GPU — producing (scene × text × task-completing motion) triples and inverting the field's data-scarcity problem.

3 · The physics judge catches errors geometry cannot see

Real incident: three consecutive harvest failures traced to the door being one-way — our scene randomizer presented the non-pushable face to the walker. No clipping/contact check could detect this; the physical effect (hinge pinned at 0° while the body pushed through) exposed it instantly. A concrete demonstration of how much information the effect signal carries.

Honest current status

Generalization to unseen door placements is still weak: 8.6% success on 16 never-trained placements — clearly above the no-scene baseline (0–2%) but below our 30% bar. Three controlled probes isolated the bottleneck to training volume for the scene-reading pathway (not the encoder); a 6× larger training round is running now. This is the current battleground.

Related work (with links)

Role	Work	One line
Baselines (scene-fit generation)	SceMoS · CVPR'26	Strongest supervised scene-fit; the one we run through our grader to show "fits but doesn't function"
Baselines (scene-fit generation)	LINGO · SIGGRAPH Asia'24	Instruction → chained in-scene behaviors + dataset
Datasets	TRUMANS · CVPR'24	Largest human-scene capture; objects rigid, no door articulation
	ParaHome · CVPR'25	The only one recording object state change (door angles); single room
	Nymeria+ · 2026	Largest in-the-wild motion; doors recorded as non-articulating openings = the scarcity argument
	RoboCasa · RSS'24	Simulated kitchens + task success checkers; reference for our detectors
Generator × RL	RLPF · 2025	Reward = trackability; precedent for sim-reward tuning (different reward)
	SimGenHOI · 2025	Generator × controller co-training (controller-centric)
	PhyMotion · 2026	Same mechanism for video models (naturalness rewards only)
	Morph / CLoSD	Physics-corrects or tracks, but never feeds back into the generator
Downstream / complementary	VLK · 2026	Synthetic motion → real humanoid; proof of the outlet our improved generator plugs into
Downstream / complementary	Kimodo · NVIDIA 2026	Our base public motion diffusion model (explicitly does not model scenes)
Components	Eureka-line · 2023–	LLM-written grading criteria; our task-scaling component (Phase C)
Components	Sonata · 2025	Pretrained point-cloud encoder used for the scene input

Also surveyed: SimDiff (2509.20927), TEXEDO (2606.22998), REFINE-DP (2603.13707), EasyTune (2602.07967) — 20+ works total; full notes in the Obsidian vault.

Plan

Plan & milestones

Stage	Window	Content	Bar & status
Infrastructure	July	Full training loop (body model port, retrainer, exploration)	DONE (1 month early)
Closed-loop proof	Aug	Show success rises every pass on a fixed scene	DONE — 0%→75%→100% (2 months early)
Scene generalization	now	Add the room's 3D shape as input; succeed at unseen placements	in progress — 8.6% now, bar 30%; 6× data round running
Scale & credibility	Sept–Oct	5–6 task types (drawer, push, sit …) · run prior methods through our grader · validate the grader itself (human agreement etc.)	≥4 task types + ≥2 baselines
Package	Oct–Nov	User study · analyses · writing	submit Nov 13

Glossary

Mini glossary

Plain term on this page	Field term	Meaning
motion generator	motion diffusion prior	A diffusion model producing human motion (skeleton trajectories) from text etc.
waypoints	guidance / constraints	Conditioning the generation on "the body passes here at this time" — exploration only, absent at deployment
select-and-retrain loop	ReST (rejection-sampling self-training)	Grade your own outputs, retrain on the passing ones; proven framework from LLM training
original model as teacher	anchor self-distillation	A loss term keeping the fine-tuned model close to the original's answers, preventing catastrophic forgetting of language
naturalness filter	plausibility gate	Physics checks (interpenetration, falls, jammed joints) that reject cheats (reward hacking)
scene summary vectors	scene tokens (Sonata)	The room's point cloud compressed to 64 vectors by a pretrained encoder, appended to the generator's input
task success rate	—	Fraction that both achieve the effect (door ≥ 60°) and pass the naturalness filter, on held-out placements and random seeds

EGM · CMU × Keio IsogawaLab · code: ayan221/effect-grounded-motion (378 unit tests, all features reviewed) · numbers measured as of 2026-07-02; scene-generalization scale-up in flight.

目指すもの: 「見た目が良い動き」ではなく「機能する動き」

先行研究ができていること / できていないこと

近い研究はどこまで来ているか（2025–26）

我々の方法: 生成 → 物理で採点 → 良いものだけで再学習、を回す

自然さフィルタが要る理由（ズル防止）

なぜ解決できると考えるか — 3つの根拠（全て実測済み）

根拠1: 閉ループが実際に回った（成功率 0% → 75% → 100%）

根拠2: 教師データを無限に作れる「工場」が動いている

根拠3: 物理の審判は、幾何チェックでは見えない誤りまで検出する

正直な現在地

関連研究リスト（リンク付き）

計画とマイルストーン

用語ミニ辞典

Motion that works, not motion that merely looks right

What prior work can and cannot do

How close is the 2025–26 wave?

Our method: generate → grade with physics → retrain on the winners → repeat

Why the naturalness filter is essential (anti-cheating)

Why we believe this is solvable — three measured reasons

1 · The loop already closed (success 0% → 75% → 100%)

2 · A working factory for unlimited training data

3 · The physics judge catches errors geometry cannot see

Honest current status

Related work (with links)

Plan & milestones

Mini glossary