Commits


pengwa authored and GitHub committed 0471f6fbb3a
Check type for building gradient graph (#17046) ### Check type for building gradient graph **Bug1**: To fix the error when running the model with ORTModule + Stage 3: ``` Exception happens when running <bound method Function.apply of <class 'onnxruntime.training.utils.hooks._zero_offload_subscriber.ORTZeROOffloadPreForwardFunction'>> Traceback (most recent call last): File "/bert_ort/pengwa/py38/lib/python3.8/site-packages/onnxruntime/training/ortmodule/_custom_autograd_function_runner.py", line 207, in call_python_forward_function wrapped_arg.requires_grad = is_training_mode and grad_flag RuntimeError: only Tensors of floating point and complex dtype can require gradients ``` This is because when running PythonA, the 3rd input is int64, we find it requires gradient during the check in gradient builder, so we set its requires_grad = True, but PyTorch thinks it is incorrect, throwing the exception. So we need understand why ORT gradient builder think the 3rd input need gradients. During `ReverseBFSWithStopGradient`, which do reverse BFS from graph outputs, it collects all nodes that are needed for computing the graph outputs. `ReverseBFSWithStopGradient` define a queue, initially add all nodes that generate graph outputs, then iterate the nodes one by one, checking each node's input, if the input did not hit stop edge and its node arg type is allowed type (float, etc), then the input node is append into the queue, do the next iteration of work. PythonOpA is such a node that is needed to compute graph outputs, then IsReachable(PythonOpA) will return True.  In the above code snippet, when node is PythonOpB, and next_node being PythonOpA, we did not check node_arg type between node and next_node on the connection of PythonOpA's 3rd input to PythonOpB's outputs. So we append the int64 typed node args to sets that require gradient. **Fix1**: add the node arg type check before appending it into require grad lists. After the fixing, A unit test failed "orttraining_test_ortmodule_api.py::test_gradient_correctness_minmax[data_type0-True-0-min] Fatal Python error: Segmentation fault". After investigation, it is another bug. **Bug2**: Without the above fix1, the execution graph looks like this  As you can see, int64 type has a gradient edge built, while it is not used for any consumers. And the execution runs well. While think twice, int type should not have grad edge built. With the Fix1, the execution graph looks like this;  So the int type node arg did not has gradient edge built. **Fix1** is fixing this problem. But another bug happens if the inital "y_node_arg_names" e.g. in this case Aten's two outputs, 1st one in float, 2nd one in int. When we check the y_node (https://github.com/microsoft/onnxruntime/blob/6e6f582e0875f15837b338c402b065fd95e0dd46/orttraining/orttraining/core/framework/gradient_graph_builder.cc#L60C16-L60C16), we did not check the data type, then add it into `y_node_args_` which is the list of graph output node args that requires gradient. Then `non_differentiable_y_node_arg_names_` did not has the int type graph output. Then https://github.com/microsoft/onnxruntime/blob/6e6f582e0875f15837b338c402b065fd95e0dd46/orttraining/orttraining/core/framework/ortmodule_graph_builder.cc#L312C18-L312C18 will try to get the grad node arg into `yield_output_node_args`, BUT the grad node arg is not built for int type node arg (with the **Fix1**). So we insert a nullptr, later when we using it, we get segment fault. **Fix2** Again, we add the type check when handle y_node_args, also add null check when getting gradient node arg and append into yield_output_node_args