Fig 1: Supporting multiple training frameworks with ONNX.
Fig. 2a: Example of how the Orchestrator Segmenter compiles relevant parts of the graph to TensorRT.
Fig. 2b: Example of incorporating multiple sub-compiler passes to produce a final stitched binary.
Fig. 3: Relative improvements in onboard resource utilization after general adoption of FTL.
Fig 4: Injecting a PyTorch GPU kernel into the final compiled graph
Fig. 5: Example of issues brought about by the model export / conversion process.
Fig 6: Using the FTL Segment Breaker in the exported graph to isolate and configure a subgraph to compile in FP32.
Fig 7: How the user can configure multi-GPU inference in their model.
Fig 8: Nuro perception detector latency over time. Note the ~27% drop in latency after multi-gpu inference is applied.