Public / onnxruntime / b84712151c0

Commits

Adrian Lizarraga authored and GitHub committed b84712151c017 Feb 2024
QNN EP: Fuse DQ -> Q sequences into a QNN Convert op (#19511)

### Description
Fuses DQ -> Q sequences into a QNN Convert operator if:
- Converting from one qtype to another. Ex: Dequantize(uint8 to float)
-> Quantize(float to uint16)
- The DQ and Q operators are not part of another node unit (i.e.,
standalone)
- The Q operator is the only consumer for the DQ operator.



### Motivation and Context
Allows faster execution of QDQ models with mixed activation types by
leveraging the QNN Convert operator, which converts between quantization
types. For certain models, this results in inference latency speed-ups
of up to 2x (depends on the number of DQ -> Q sequences).

#### Example for Add node unit with 16-bit I/O:

Original:
```
u8 ----> DQ ---> Q ---u16--> Add ---u16-->
                              ^
                              |
u16 --------------------------+
```

After fusing DQ -> Q:
```
u8 ----> Convert ---u16--> Add ---u16-->
                            ^
                            |
u16 ------------------------+
```