Speaker recognition in diarization

Hi,

Whereas transcription and alignment seems to work pretty well, speaker recognition during diarization seems rather deficient in some cases.

This may be an intrinsic limitation of pyannote, as the results are almost identical with this app and with whisperX.

Some examples below.

Do you believe that segmenting first, then diarizing each track independently and then trying to mix the results could yield better results?

Best,

EXAMPLES

Here it works well:

But here it is a disaster:

In the first one there are only two characters, a man and a woman, and therefore it seems like an easy scenario.

The second one has three characters with very different pitches and accents. Most sentences are attributed to a single speaker (0).

The third one has three males with similar pitches and accents. Speaker attribution is pretty much random.