Commits


Dmitri Smirnov authored and GitHub committed c52636e187e
Implement Tokenizer op (#31) * Implement separator tokenizer with TST. TODO: Clarify what to do if the output is empty and no start/end text markers required. Also see if the current search algo is acceptable. * Add utf8 util test * For empty output produce [C] -> [C][0], [N][C] -> [N][C][0] * Augument TST search with match conflict resolution in favor of earlier specified pattern matches. * Address MAcOS build error. * Adjust error message * Address review comments. * Remove nested loops. * Remove 3rd party utf8 validation code. * Address review comments part I. * Move padding outside start/end markers. Split unit tests for invidividual test cases. * Fix a common prefix bug reported by Xavier.