Commits


Adrian Lizarraga authored and GitHub committed cdc5d72ba9d
[QDQ Quant] Support mixed-precision integer quantization via overrides (#19925) ### Description Adds support for specifying mixed precision QDQ models via tensor quantization overrides. ### Motivation and Context This PR implements an approach for supported "mixed precision" models. The following figure demonstrates an example mixed precision model as defined in this PR.  A mixed precision QDQ model consists of regions with different activation/weight quantization data types. The boundary between regions converts between activation quantization data types (e.g., uint8 to uint16) using a DQ to Q sequence. The ability to specify regions with different quantization data types enables exploring the tradeoffs between accuracy and latency. A higher integer precision may improve accuracy at the expense of latency, so selectively promoting certain regions to a higher precision can aid in achieving a desirable balance in key metrics. #### Current support By default, the ORT quantizer supports specifying default activation and weight quantization data types for the entire model. A recent PR added support for specifying basic quantization overrides at the tensor level via the `extra_options["TensorQuantOverrides"]` configuration: ``` TensorQuantOverrides = dictionary : Default is {}. Set tensor quantization overrides. The key is a tensor name and the value is a list of dictionaries. For per-tensor quantization, the list contains a single dictionary. For per-channel quantization, the list contains a dictionary for each channel in the tensor. Each dictionary contains optional overrides with the following keys and values. 'quant_type' = QuantType : The tensor's quantization data type. 'scale' = Float : The scale value to use. Must also specify `zero_point` if set. 'zero_point' = Int : The zero-point value to use. Must also specify `scale` is set. 'symmetric' = Bool : If the tensor should use symmetric quantization. Invalid if also set `scale` or `zero_point`. 'reduce_range' = Bool : If the quantization range should be reduced. Invalid if also set `scale` or `zero_point`. 'rmax' = Float : Override the maximum real tensor value in calibration data. Invalid if also set `scale` or `zero_point`. 'rmin' = Float : Override the minimum real tensor value in calibration data. Invalid if also set `scale` or `zero_point`. ``` The tensor-level overrides are currently used to override the quantization type for weights/initializers or to set specific scale/zero-point values for a tensor (e.g., QNN requires Sigmoid to use a specific scale/zero-point at its output). However, these overrides are not typically used to override activation quantization types due in large part to operator data type constraints. Consider, for example, that all inputs and outputs to an Add operator must be of the same data type. Consequently, using tensor-level overrides to promote the Add’s output to 16-bits would force the inputs to also be overridden to 16-bit. In turn, this would have a cascading effect on potentially the entire graph. The solution implemented by this PR is to allow the specification of tensor boundaries where the activation quantization data type changes. #### The approach The following figure shows a model with a region that has been promoted to 16-bit from the default 8-bit activation type.  Note the following observations: - Op2’s output is consumed by Op4, Op7, and Op8. Op4 consumes the converted u16 type, while Op7 and Op8 consume the original u8 type. - Op3’s output is converted from u8 to u16. Op5 consumes the converted u16 type. - Op4’s output is just u16 (not converted). - Op5’s output is converted from u16 to u8. Op6 consumes the u8 type. The approach implemented by this PR uses the tensor-level quantization overrides to specify a tensor’s quantization type at both the producer and consumer ends. **The following shows the overrides necessary to create this mixed precision QDQ model.** ```python3 overrides = { “Op2_out”: [{“quant_type”: QUInt8, “convert”: {“quant_type”: QUInt16, “recv_nodes”: {“Op4”}}}], “Op3_out”: [{“quant_type”: QUInt8, “convert”: {“quant_type”: QUInt16, “recv_nodes”: {“Op5”}}}], “Op4_out”: [{“quant_type”: QUInt16}], “Op5_out”: [{“quant_type”: QUInt16, “convert”: {“quant_type”: QUInt8, “recv_nodes”: {“Op6”}}}] } ```