Semi-supervised learning has become a topic of a significant practical importance in today's data analysis applications, where the amount of unlabeled data is growing exponentially while user input remains limited by logistics and expense. Semi-supervised clustering, as a subclass of SSL, makes use of user input in the form of relationships between data points (e.g., pairs of data points belonging to the same class or different classes) to remarkably improve the performance of unsupervised clustering while reflecting user-defined knowledge of the underlying data distribution. Existing algorithms incorporate such constraints as either hard constraints or soft penalties, which contradicts the generative aspect of the model, resulting in formulations that are suboptimal and not sufficiently general. We propose a fully generative, probabilistic model, rather than heuristic, that reflects the joint distribution given by the user-defined pairwise relationships.
We propose a semi-supervised clustering algorithm that uses a Gaussian mixture model for the data and performs model fitting using expectation-maximization.