![]() ![]() ![]() You can find recommended batch size configurations here 2. A larger batch size means more training samples per update, and is thus closer to a 'true' gradient update that you'd get using all the data at once. A noisy training loss can be combated by increasing your batch size. This generally gives noisy parameter updates, which can throw your model off and delay it reaching a local optimum. One thing I've noticed a lot looking at training logs is noisy training loss curves. Hey my experience, the most important three are: Use this nearest sentence as your output. Take the prediction from the Whisper model, and find the sentence in your corpus of 1000 sentences that is most similar to this prediction. ![]() Using the zero-shot model (no fine-tuning) to generate Whisper predictions.get the audio for your text sentences and train on audio + text) according to the blog post Fine-tuning the model on audio-transcription pairs (i.e.When we use this fine-tuned model at inference time, this time with the audio inputs, the weights will be messed-up for speech recognition and the model will likely fail. Thus, the model goes from being purposed for speech recognition (speech to text) to causal language modelling (text to text). ![]() This will change the weights such that the model only uses the previous tokens and not the encoder hidden representations. If we omit the encoder hidden-states, we completely change the functionality of the Whisper model: the decoder now only predicts tokens conditional on the previously predicted tokens, not the encoder hidden states. The decoder auto-regressively predicts text tokens, conditional on previously predicted tokens and the encoder hidden states (see ). The encoder transforms the audio inputs into a set of hidden state representations, extracting important features from the spoken speech. Whisper is an encoder-decoder architecture. My worry here is that we will completely break the model and loose all it's pre-trained capabilities if we do this. I would strongly advise against fine-tuning only the language model (decoder) of the Whisper model on text-only data. What language are you fine-tuning on? It's probably quite likely that all the characters you need are already in the pre-trained Whisper tokenizer! Yes there’s a bit of redundancy in the tokenizer, but our overall performance should be better! So I’d recommend you keep the pre-trained tokenizer, and simply set the correct language when you instantiate the processor in this line: The Whisper model quickly learns which bit of the pre-trained tokenizer to use when fine-tuning. If we use the pre-trained one, we can use all of the weights (and so all of the knowledge!). Why? Because then we can also leverage all of the pre-trained Whisper weights directly! If we build a new tokenizer, we have to randomly initialise some of the Whisper weights to work with our new tokenizer, meaning we lose some of the knowledge from pre-training. This means that the pre-trained tokenizer already has a vast vocabulary encompassing many thousands of words! I would recommend that you leverage this pre-trained tokenizer directly rather than training a new one. The Whisper model is pre-trained on 96 languages. Hey advice would be to follow this blog-post for fine-tuning the model: ![]()
0 Comments
Leave a Reply. |
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |