Public / onnxruntime / c52636e187e

Commits

Dmitri Smirnov authored and GitHub committed c52636e187e06 Dec 2018

Implement Tokenizer op (#31)

* Implement separator tokenizer with TST.
  TODO: Clarify what to do if the output is empty and no start/end text
  markers required. Also see if the current search algo is acceptable.

* Add utf8 util test

* For empty output produce [C] -> [C][0], [N][C] -> [N][C][0]

* Augument TST search with match conflict resolution in favor of
  earlier specified pattern matches.

* Address MAcOS build error.

* Adjust error message

* Address review comments.

* Remove nested loops.

* Remove 3rd party utf8 validation code.

* Address review comments part I.

* Move padding outside start/end markers.
  Split unit tests for invidividual test cases.

* Fix a common prefix bug reported by Xavier.