Due to how Microsoft Sam was implemented to run on low spec hardware, it contained quite a few funny quirks in it’s synthesis. Today I’d like to experiment with weird and funny speech synthesis phrases with ONNX’s TTS engine Kokoro. This is purely experimental and I simply wanted to see how can I break an AI TTS model.
So here’s my code and some absolutely weird results:
Model used:
hexgrad. (2025). GitHub – hexgrad/misaki: G2P. GitHub. https://github.com/hexgrad/misaki
onnx-community/Kokoro-82M-v1.0-ONNX · Hugging Face. (2025). Huggingface.co. https://huggingface.co/onnx-community/Kokoro-82M-v1.0-ONNX
Results
ROFLcopter
This is a classic one!
My ROFLCopter goes soi soi soi soi soi soi soi soi soi soi soi soi soi
From Kokoro:
My Steamroller
My steamroller goes lol lol lol lol lol lol lol lol lol lol lol lol lol lol
However, this time it got absolutely no audio for the lol part. Which I realised that I need to provide a fallback phoneme source for Misaki, which in this case, espeak-ng.
Kokoro with espeak-ng:
Now that’s more like it! Now let’s try the ROFLcopter with espeak-ng fallback:
Now we are talking like SAM!
Why
According to Kokoro’s code we can see that there are two components that can get confused: Misaki and Kokoro. Misaki is the tokenizer and can be confused with invalid tokens but can be fixed with espeak-ng fallback. While Kokoro can be confused by uncommon tokens.
Code used in this experiment
(converted from a Jupyter Notebook)
import IPython
import os
import numpy as np
from onnxruntime import InferenceSession
from misaki import en
import scipy.io.wavfile as wavfile
Code language: Python (python)
def phonemes_to_tokens(phonemes):
vocab = {
";": 1,
":": 2,
",": 3,
".": 4,
"!": 5,
"?": 6,
"—": 9,
"…": 10,
"\"": 11,
"(": 12,
")": 13,
"“": 14,
"”": 15,
" ": 16,
"\u0303": 17,
"ʣ": 18,
"ʥ": 19,
"ʦ": 20,
"ʨ": 21,
"ᵝ": 22,
"\uAB67": 23,
"A": 24,
"I": 25,
"O": 31,
"Q": 33,
"S": 35,
"T": 36,
"W": 39,
"Y": 41,
"ᵊ": 42,
"a": 43,
"b": 44,
"c": 45,
"d": 46,
"e": 47,
"f": 48,
"h": 50,
"i": 51,
"j": 52,
"k": 53,
"l": 54,
"m": 55,
"n": 56,
"o": 57,
"p": 58,
"q": 59,
"r": 60,
"s": 61,
"t": 62,
"u": 63,
"v": 64,
"w": 65,
"x": 66,
"y": 67,
"z": 68,
"ɑ": 69,
"ɐ": 70,
"ɒ": 71,
"æ": 72,
"β": 75,
"ɔ": 76,
"ɕ": 77,
"ç": 78,
"ɖ": 80,
"ð": 81,
"ʤ": 82,
"ə": 83,
"ɚ": 85,
"ɛ": 86,
"ɜ": 87,
"ɟ": 90,
"ɡ": 92,
"ɥ": 99,
"ɨ": 101,
"ɪ": 102,
"ʝ": 103,
"ɯ": 110,
"ɰ": 111,
"ŋ": 112,
"ɳ": 113,
"ɲ": 114,
"ɴ": 115,
"ø": 116,
"ɸ": 118,
"θ": 119,
"œ": 120,
"ɹ": 123,
"ɾ": 125,
"ɻ": 126,
"ʁ": 128,
"ɽ": 129,
"ʂ": 130,
"ʃ": 131,
"ʈ": 132,
"ʧ": 133,
"ʊ": 135,
"ʋ": 136,
"ʌ": 138,
"ɣ": 139,
"ɤ": 140,
"χ": 142,
"ʎ": 143,
"ʒ": 147,
"ʔ": 148,
"ˈ": 156,
"ˌ": 157,
"ː": 158,
"ʰ": 162,
"ʲ": 164,
"↓": 169,
"→": 171,
"↗": 172,
"↘": 173,
"ᵻ": 177
}
tokens = [vocab[char] for char in phonemes if char in vocab]
return tokens
Code language: Python (python)
fallback = espeak.EspeakFallback(british=False) # en-us
g2p = en.G2P(trf=False, british=False, fallback=fallback)
def tttk(s):
phonemes, tokens = g2p(s)
return phonemes_to_tokens(phonemes)
Code language: Python (python)
tokens = tttk("My ROFLCopter goes soi soi soi soi soi soi soi soi soi soi soi soi soi")
# Context length is 512, but leave room for the pad token 0 at the start & end
assert len(tokens) <= 510, len(tokens)
# Style vector based on len(tokens), ref_s has shape (1, 256)
voices = np.fromfile('./voices/af.bin', dtype=np.float32).reshape(-1, 1, 256)
ref_s = voices[len(tokens)]
tokens = [[0, *tokens, 0]]
model_name = 'model.onnx' # Options: model.onnx, model_fp16.onnx, model_quantized.onnx, model_q8f16.onnx, model_uint8.onnx, model_uint8f16.onnx, model_q4.onnx, model_q4f16.onnx
sess = InferenceSession(os.path.join('onnx', model_name))
audio = sess.run(None, dict(
input_ids=tokens,
style=ref_s,
speed=np.ones(1, dtype=np.float32),
))[0]
wavfile.write('audio.wav', 24000, audio[0])
IPython.display.Audio('audio.wav')
Code language: Python (python)