Rosepetal AI
Back to Blog
AI Vision Technology

Model Conversion Times for Edge AI Deployment

Benchmarking PyTorch YOLOv8, ONNX, TensorRT FP32/FP16/INT8, and TorchScript runtimes with accuracy, latency, and throughput measurements.

Model Conversion Times for Edge AI Deployment

Optimizing an inspection model is a lot like selecting the right tool on a production line: the model may be the same, but its container (runtime) decides how quickly it reacts. To make that choice simpler, we converted our YOLOv8-based quality-control model into ONNX, TensorRT (multiple precisions), and TorchScript, then compared accuracy and speed side by side. Each runtime was tested with the same validation set. We also rechecked precision and recall after every conversion so the accuracy comparison stayed fair.

Accuracy Snapshot

RuntimePrecisionRecallmAP50mAP50-95
PyTorch YOLOv8 FP320.9330.8820.9160.517
ONNX (static input)0.9340.8820.9180.517
TensorRT FP32 (dynamic profile)0.9340.8820.9180.517
TensorRT FP160.9340.8820.9190.516
TensorRT INT80.9190.8820.9310.509
TorchScript0.9340.8820.9180.517

The chart below shows how little the accuracy moved even after the most aggressive conversion. In short: everything stayed within striking distance of the PyTorch baseline, and the INT8 version actually gave us a tiny boost on mAP50.

Bar chart showing mAP50-95 results for YOLO FP32, ONNX, TensorRT FP32/FP16/INT8, and TorchScript.

Speed & Throughput

The table and charts turn the stopwatch data into an easier view. Values are in milliseconds for single predictions and seconds for the time it takes to finish 250 frames.

Runtime1x1 Latency (ms)1x2 Latency (ms)2x1 Latency (ms)250 Frames @1x1 (s)
PyTorch YOLOv8 FP3220.021.038.05.00
ONNX (static input)34.633.667.08.64
TensorRT FP32 (dynamic)17.820.235.64.44
TensorRT FP32 (ND batch-1)21.6---43.15.39
TensorRT FP1611.010.118.72.75
TensorRT INT89.28.515.42.29
TorchScript24.224.949.06.05
  • The best TensorRT FP32 setup lowered single-stream latency to 17.8 ms, a quick win without touching precision.
  • FP16 (the “half” version) almost halved latency to 11 ms and needed no retraining.
  • INT8 required a calibration pass but ran at 9.2 ms and finished 250 frames in 2.29 s.

Bar chart showing 1x1 latency for each runtime option.

To visualize the trade-off between raw speed and accuracy, we plotted each runtime on the chart below.

Scatter chart showing latency on the x-axis and mAP50-95 on the y-axis for each runtime.

TensorRT Profile Tuning

To squeeze out a bit more speed we exported two FP32 TensorRT engines:

  1. ND batch-1 profile (nd_b1): tuned strictly for batch-1 inputs.
  2. Dynamic profile for batch-1 / batch-2 (d_b1 / nd_b2): handles small bursts (batch-2) while staying fast on batch-1.

The dynamic profile responded faster when we spun up a second stream (20.2 ms vs 43.1 ms) and still beat the PyTorch baseline. That tuning step alone recovered roughly one millisecond per frame without sacrificing accuracy.

Precision-Driven Speedups

  • FP16 (half): Halved inference time to 11 ms with a single command. No retraining, just export-and-go.
  • INT8 (int): Adds a short calibration step but rewards it with 9.2 ms single-stream latency and 3.84 s for 250 frames when running two pipelines.

Takeaways

  • ONNX is a useful interchange format, but treat it as a temporary stop; it matched accuracy yet ran 70% slower than PyTorch.
  • TensorRT FP32 with a dynamic profile is the easiest speed boost: no accuracy loss and a measurable latency cut.
  • FP16 strikes the best balance for most factories, giving a 2x speedup with identical detection quality.
  • INT8 is ideal for high-speed lines where every millisecond counts; it cleared our sample almost twice as fast as the baseline.

Ready to Transform Your Quality Control?

Discover how Rosepetal AI can help you implement cutting-edge computer vision solutions