Speaker recognition in diarization
Hi,
Whereas transcription and alignment seems to work pretty well, speaker recognition during diarization seems rather deficient in some cases.
This may be an intrinsic limitation of pyannote, as the results are almost identical with this app and with whisperX.
Some examples below.
Do you believe that segmenting first, then diarizing each track independently and then trying to mix the results could yield better results?
Best,
Ed
EXAMPLES
Here it works well:
https://www.youtube.com/watch?v=Fyb2AiF1feI
But here it is a disaster:
https://www.youtube.com/watch?v=DxxAwDHgQhE
https://www.youtube.com/watch?v=qHrN5Mf5sgo
In the first one there are only two characters, a man and a woman, and therefore it seems like an easy scenario.
The second one has three characters with very different pitches and accents. Most sentences are attributed to a single speaker (0).
The third one has three males with similar pitches and accents. Speaker attribution is pretty much random.