The aim of this work is to provide robust, low-complexity demixing of sound sources from a set of microphone signals for a typical meeting scenario where the source mixture is relatively sparse in time. We define a similarity matrix that characterizes the similarity of the spatial signature of the observations at different time instants within a frequency band. Each entry of the similarity matrix is the sum of a set of kernelized similarity measures, each operating on single frequency bin. The kernelization leads to high robustness as it reduces the importance of outliers. Clustering by means of affinity propagation provides the separation of talkers without the need to specify the talker number in advance. The clusters can be used directly for separation, or they can be used as a global pre-processing method that identifies sources for an adaptive demixing procedure. Our experimental results confirm the that the approach performs significantly better than two reference methods.