GPU 集群状态
colligo-laser02-prod-uw2 · Adobe Pluto · 实时 capacity snapshot
采集时间:2026-06-05
🌐 全集群空闲卡 (cluster-wide)
| 机型 | GPU | 节点 | 总卡 | 已用 | 空闲 | 利用率 |
| p4d.24xlarge | A100-40GB | 219 | 1752 | 1242 | 510 | 71% |
| p5.48xlarge | H100 (80GB) | 510 | 4080 | 3034 | 1046 | 74% |
| p5en.48xlarge | H200 (141GB) | 598 | 4784 | 4771 | 13 | 100% |
| p4de.24xlarge | A100-80GB | 1009 | 8072 | 8029 | 43 | 99% |
| p6-b200.48xlarge | B200 (180GB) | 29 | 232 | 230 | 2 | 99% |
| g5.12xlarge | A10G | 3 | 12 | 0 | 12 | 0% |
🎯 你的配额(可调度量)
| 资源池 | GPU | Alloc | Used | Avail | Preempt |
| GAI-415 Foundry | H200 | 1144 | 1095 | 47 | 0 |
| GAI-415 Foundry | B200 | 240 | 227 | 13 | 0 |
| GAI-415 Foundry | H100 | 0 | 0 | 0 | 130 |
| GAI-415 Foundry | A100-80GB | 0 | 0 | 0 | 4 |
| Foundry-THD (project) | H200 | 0 | 18 | -18 ⚠️ | 0 |
| Foundry-THD (project) | B200 | 0 | 1 | -1 ⚠️ | 0 |
| Foundry-THD (project) | H100 | 0 | 0 | 0 | 128 |
| GAI-416 Platform Opt | H100 | 120 | 28 | 92 | 0 |
| xPU Efficiency (project) | H100 | 120 | 28 | 92 | 0 |
💡 建议
- 推荐 要大量卡 / 现在就跑 → 抢 H100(集群空闲 1046,你有 128–130 张 preemptible 额度,最稳)或 A100-40GB(空闲 510)。
- 已满 H200 / A100-80GB / B200 利用率 99–100%;H200 在保配额只剩 47 张,且 Foundry-THD project 已超用,要等或走抢占。
- 命令:
python3 manage.py create --gpu-type H100 --gpu-count N --preemptible --start