Features of ailia Tokenizer

In this page, we present the features that are provided by both the C and the C# APIs.

Compatibility of Tokenizer

ailiaTokenizerEncode matches the following call from transformers. SpecialToken will be encoded as text. No padding or truncation occurs.

input_ids = tokenizer(sents, split_special_tokens=True)

ailiaTokenizerEncodeWithSpecialTokens matches the following call from transformers. SpecialToken will be encoded as SpecialToken. No padding or truncation occurs.

input_ids = tokenizer(sents)

ailiaTokenizerDecode matches the following call from transformers. SpecialToken will not be output.

tokenizer.decode(input_ids, skip_special_tokens=True)

ailiaTokenizerDecodeWithSpecialTokens matches the following call from transformers. SpecialToken will be output.

tokenizer.decode(input_ids)

Tokenizer Types

AILIA_TOKENIZER_TYPE_WHISPER

The following Python processes are supported.

from transformers import WhisperTokenizer
tokenizer = WhisperTokenizer.from_pretrained("openai/whisper-base", predict_timestamps=True)
inputs = tokenizer(sents)

If you want to match the following implementation of OpenAI, please remove the leading SOT and the trailing EOT.

from tokenizer import get_tokenizer
is_multilingual = True
tokenizer = get_tokenizer(is_multilingual)

AILIA_TOKENIZER_TYPE_CLIP

Corresponds to the following Python process; SOT and EOT will be given. Padding is not performed, so if necessary, pad trailing zeros up to 77 symbols.

from transformers import CLIPTokenizer
tokenizer = CLIPTokenizer.from_pretrained("openai/clip-vit-large-patch14")
inputs = tokenizer(sents)

from simple_tokenizer import SimpleTokenizer as _Tokenizer
_tokenizer = _Tokenizer()
sot_token = _tokenizer.encoder["<|startoftext|>"]
eot_token = _tokenizer.encoder["<|endoftext|>"]
all_tokens = [[sot_token] + _tokenizer.encode(text) + [eot_token] for text in texts]

AILIA_TOKENIZER_TYPE_XLM_ROBERTA

Corresponds to the following process in Python. Separately, you will need to provide the sentimentpiece.bpe.model to the ailiaTokenizerOpenModelFile.

from transformers import XLMRobertaTokenizer
tokenizer = XLMRobertaTokenizer.from_pretrained('sentence-transformers/paraphrase-multilingual-mpnet-base-v2')
inputs = tokenizer(sents)

AILIA_TOKENIZER_TYPE_MARIAN

Corresponds to the following process in Python. Separately, source.spm must be given to ailiaTokenizerOpenModelFile.

from transformers import MarianTokenizer
tokenizer = MarianTokenizer.from_pretrained("staka/fugumt-en-ja")
inputs = tokenizer(sents)

AILIA_TOKENIZER_TYPE_BERT_JAPANESE_WORDPIECE

The following Python processes are supported: UKFC conversion is done by ailia Tokenizer. You need to give ipadic and tokenizer_wordpiece/vocab.txt separately.

from transformers import BertJapaneseTokenizer
tokenizer = BertJapaneseTokenizer.from_pretrained('cl-tohoku/bert-base-japanese-whole-word-masking')
inputs = tokenizer(sents)

If you want to match with convert_tokens_to_ids, you need to remove the [CLS] and [SEP] symbols from the beginning and the end.

tokenized_text = tokenizer.tokenize(text)

indexed_tokens = tokenizer.convert_tokens_to_ids(tokenized_text)

AILIA_TOKENIZER_TYPE_BERT_JAPANESE_CHARACTER

The following Python processes are supported: UKFC conversion is done by ailia Tokenizer. Separately, ipadic must be given to ailiaTokenizerOpenDictionaryFile and tokenizer_character/vocab.txt to ailiaTokenizerOpenVocabFile.

from transformers import BertJapaneseTokenizer
tokenizer = BertJapaneseTokenizer.from_pretrained('cl-tohoku/bert-base-japanese-char-whole-word-maskin')
inputs = tokenizer(sents)

If you want to match with convert_tokens_to_ids, you need to remove the [CLS] and [SEP] symbols from the beginning and the end.

tokenized_text = tokenizer.tokenize(text)

indexed_tokens = tokenizer.convert_tokens_to_ids(tokenized_text)

AILIA_TOKENIZER_TYPE_T5

Corresponds to the following Python process, which inserts an EOS symbol at the end of the output in the same way as add_special_tokens=True. Remove the EOS symbol at the end of the output. Separately, you need to give the file "spiece.model" to AsiaTokenizerOpenModelFile.

from transformers import T5Tokenizer
tokenizer = T5Tokenizer.from_pretrained('sonoisa/t5-base-japanese-title-generation')
inputs = tokenizer(sents)

AILIA_TOKENIZER_TYPE_ROBERTA

Corresponds to the following Python process, which inserts an EOS symbol at the end of the output. Separately, you need to provide vocab.json to ailiaTokenizerOpenVocabFile and merges.txt to ailiaTokenizerOpenMergeFile.

from transformers import RobertaTokenizer
tokenize = RobertaTokenizer.from_pretrained('roberta-base')
inputs = tokenizer(sents)

AILIA_TOKENIZER_TYPE_BERT

Corresponds to the following Python process. You need to give vocab.txt and tokenizer_config.json separately.

from transformers import BertTokenizer
tokenizer = BertTokenizer.from_pretrained('google-bert/bert-base-uncased')
inputs = tokenizer(sents)

from transformers import BertTokenizer
tokenizer = BertTokenizer.from_pretrained('google-bert/bert-base-cased')
inputs = tokenizer(sents)

If you want to match with convert_tokens_to_ids, you need to remove the [CLS] and [SEP] symbols from the beginning and the end.

tokenized_text = tokenizer.tokenize(text)

indexed_tokens = tokenizer.convert_tokens_to_ids(tokenized_text)

AILIA_TOKENIZER_TYPE_GPT2

Corresponds to the following Python process. Separately, you need to provide vocab.json and merges.txt.

from transformers import GPT2Tokenizer
tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
inputs = tokenizer(sents)

AILIA_TOKENIZER_TYPE_LLAMA

Corresponds to the following Python process. Separately, tokenizer.model must be given to ailiaTokenizerOpenModelFile.

from transformers import LlamaTokenizer
tokenizer = LlamaTokenizer.from_pretrained("liuhaotian/llava-v1.5-7b")
inputs = tokenizer(sents)