MITPress - SemiSupervised Learning
MITPress - SemiSupervised Learning
Olivier Chapelle,
Bernhard Schölkopf,
and Alexander Zien
Semi-Supervised Learning
Semi-Supervised Learning
Adaptive Computation and Machine Learning
Bioinformatics: The Machine Learning Approach, Pierre Baldi and Søren Brunak
Reinforcement Learning: An Introduction, Richard S. Sutton and Andrew G. Barto
Graphical Models for Machine Learning and Digital Communication, Brendan J.
Frey
Learning in Graphical Models, Michael I. Jordan
Causation, Prediction, and Search, second edition, Peter Spirtes, Clark Glymour,
and Richard Scheines
Principles of Data Mining, David Hand, Heikki Mannila, and Padhraic Smyth
Bioinformatics: The Machine Learning Approach, second edition, Pierre Baldi and
Søren Brunak
Learning Kernel Classifiers: Theory and Algorithms, Ralf Herbrich
Learning with Kernels: Support Vector Machines, Regularization, Optimization, and
Beyond, Bernhard Schölkopf and Alexander J. Smola
Introduction to Machine Learning, Ethem Alpaydin
Gaussian Processes for Machine Learning, Carl Edward Rasmussen and Christo-
pher K. I. Williams
Semi-Supervised Learning, Olivier Chapelle, Bernhard Schölkopf, and Alexander
Zien
Semi-Supervised Learning
Olivier Chapelle
Bernhard Schölkopf
Alexander Zien
All rights reserved. No part of this book may be reproduced in any form by any electronic
or mechanical means (including photocopying, recording, or information storage and retrieval)
without permission in writing from the publisher.
10 9 8 7 6 5 4 3 2 1
Contents
Series Foreword xi
Preface xiii
I Generative Models 13
VI Perspectives 395
References 479
Contributors 503
Index 509
Series Foreword
The goal of building systems that can adapt to their environments and learn from
their experience has attracted researchers from many fields, including computer
science, engineering, mathematics, physics, neuroscience, and cognitive science.
Out of this research has come a wide variety of learning techniques that have
the potential to transform many scientific and industrial fields. Recently, several
research communities have converged on a common set of issues surrounding su-
pervised, unsupervised, and reinforcement learning problems. The MIT Press series
on Adaptive Computation and Machine Learning seeks to unify the many diverse
strands of machine learning research and to foster high-quality research and inno-
vative applications.
Thomas Dietterich
Preface
During the last years, semi-supervised learning has emerged as an exciting new
direction in machine learning reseach. It is closely related to profound issues of how
to do inference from data, as witnessed by its overlap with transductive inference
(the distinctions are yet to be made precise).
At the same time, dealing with the situation where relatively few labeled training
points are available, but a large number of unlabeled points are given, it is directly
relevant to a multitude of practical problems where is it relatively expensive to
produce labeled data, e.g., the automatic classification of web pages. As a field,
semi-supervised learning uses a diverse set of tools and illustrates, on a small scale,
the sophisticated machinery developed in various branches of machine learning such
as kernel methods or Bayesian techniques.
As we work on semi-supervised learning, we have been aware of the lack of
an authoritative overview of the existing approaches. In a perfect world, such an
overview should help both the practitioner and the researcher who wants to enter
this area. A well researched monograph could ideally fill such a gap; however, the
field of semi-supervised learning is arguably not yet sufficiently mature for this.
Rather than writing a book which would come out in three years, we thus decided
instead to provide an up-to-date edited volume, where we invited contributions by
many of the leading proponents of the field. To make it more than a mere collection
of articles, we have attempted to ensure that the chapters form a coherent whole
and use consistent notation. Moreover, we have written a short introduction, a
dialogue illustrating some of the ongoing debates in the underlying philosophy of
the field, and we have organized and summarized a comprehensive benchmark of
semi-supervised learning.
Benchmarks are helpful for the practitioner to decide which algorithm should be
chosen for a given application. At the same time, they are useful for researchers
to choose issues to study and further develop. By evaluating and comparing the
performance of many of the presented methods on a set of eight benchmark
problems, this book aims at providing guidance in this respect. The problems are
designed to reflect and probe the different assumptions that the algorithms build
on. All data sets can be downloaded from the book web page, which can be found
at http://www.kyb.tuebingen.mpg.de/ssl-book/.
Finally, we would like to give thanks to everybody who contributed towards the
success of this book project, in particular to Karin Bierig, Sabrina Nielebock, Bob
Prior, to all chapter authors, and to the chapter reviewers.
1 Introduction to Semi-Supervised Learning
Traditionally, there have been two fundamentally different types of tasks in machine
learning.
unsupervised The first one is unsupervised learning. Let X = (x1 , . . . , xn ) be a set of n examples
learning (or points), where xi ∈ X for all i ∈ [n] := {1, . . . , n}. Typically it is assumed
that the points are drawn i.i.d. (independently and identically distributed) from
a common distribution on X. It is often convenient to define the (n × d)-matrix
X = (x⊤ ⊤
i )i∈[n] that contains the data points as its rows. The goal of unsupervised
learning is to find interesting structure in the data X. It has been argued that the
problem of unsupervised learning is fundamentally that of estimating a density
which is likely to have generated X. However, there are also weaker forms of
unsupervised learning, such as quantile estimation, clustering, outlier detection,
and dimensionality reduction.
supervised The second task is supervised learning. The goal is to learn a mapping from
learning x to y, given a training set made of pairs (xi , yi ). Here, the yi ∈ Y are called
the labels or targets of the examples xi . If the labels are numbers, y = (yi )⊤ i∈[n]
denotes the column vector of labels. Again, a standard requirement is that the pairs
(xi , yi ) are sampled i.i.d. from some distribution which here ranges over X × Y.
The task is well defined, since a mapping can be evaluated through its predictive
performance on test examples. When Y = R or Y = Rd (or more generally, when the
labels are continuous), the task is called regression. Most of this book will focus on
classification (there is some work on regression in chapter 23), i.e., the case where
y takes values in a finite set (discrete labels). There are two families of algorithms
generative for supervised learning. Generative algorithms try to model the class-conditional
methods
2 Introduction to Semi-Supervised Learning
In fact, p(x|y)p(y) = p(x, y) is the joint density of the data, from which pairs
discriminative (xi , yi ) could be generated. Discriminative algorithms do not try to estimate how
methods the xi have been generated, but instead concentrate on estimating p(y|x). Some
discriminative methods even limit themselves to modeling whether p(y|x) is greater
than or less than 0.5; an example of this is the support vector machine (SVM). It
has been argued that discriminative models are more directly aligned with the goal
of supervised learning and therefore tend to be more efficient in practice. These two
frameworks are discussed in more detail in sections 2.2.1 and 2.2.2.
1. For simplicity, we are assuming that all distributions have densities, and thus we restrict
ourselves to dealing with densities.
1.1 Supervised, Unsupervised, and Semi-Supervised Learning 3
output a prediction function which is defined on the entire space X. Many methods
described in this book will be transductive; in particular, this is rather natural for
inference based on graph representations of the data. This issue will be addressed
again in section 1.2.4.
Probably the earliest idea about using unlabeled data in classification is self-
self-learning learning, which is also known as self-training, self-labeling, or decision-directed
learning. This is a wrapper-algorithm that repeatedly uses a supervised learning
method. It starts by training on the labeled data only. In each step a part of
the unlabeled points is labeled according to the current decision function; then
the supervised method is retrained using its own predictions as additional labeled
points. This idea has appeared in the literature already for some time (e.g., Scudder
(1965); Fralick (1967); Agrawala (1970)).
An unsatisfactory aspect of self-learning is that the effect of the wrapper depends
on the supervised method used inside it. If self-learning is used with empirical risk
minimization and 1-0-loss, the unlabeled data will have no effect on the solution
at all. If instead a margin maximizing method is used, as a result the decision
boundary is pushed away from the unlabeled points (cf. chapter 6). In other cases
it seems to be unclear what the self-learning is really doing, and which assumption
it corresponds to.
Closely related to semi-supervised learning is the concept of transductive
transductive inference, or transduction, pioneered by Vapnik (Vapnik and Chervonenkis, 1974;
inference Vapnik and Sterin, 1977). In contrast to inductive inference, no general decision rule
is inferred, but only the labels of the unlabeled (or test) points are predicted. An
early instance of transduction (albeit without explicitly considering it as a concept)
was already proposed by Hartley and Rao (1968). They suggested a combinatorial
optimization on the labels of the test points in order to maximize the likelihood of
their model.
It seems that semi-supervised learning really took off in the 1970s when the
mixture of problem of estimating the Fisher linear discriminant rule with unlabeled data
Gaussians was considered (Hosmer, 1973; McLachlan, 1977; O’Neill, 1978; McLachlan and
Ganesalingam, 1982). More precisely, the setting was in the case where each class-
conditional density is Gaussian with equal covariance matrix. The likelihood of
the model is then maximized using the labeled and unlabeled data with the help
of an iterative algorithm such as the expectation-maximization (EM) algorithm
(Dempster et al., 1977). Instead of a mixture of Gaussians, the use of a mixture
of multinomial distributions estimated with labeled and unlabeled data has been
investigated in (Cooper and Freeman, 1970).
Later, this one component per class setting has been extended to several com-
ponents per class (Shahshahani and Landgrebe, 1994) and further generalized by
Miller and Uyar (1997).
Learning rates in a probably approximately correct (PAC) framework (Valiant,
4 Introduction to Semi-Supervised Learning
theoretical 1984) have been derived for the semi-supervised learning of a mixture of two
analysis Gaussians by Ratsaby and Venkatesh (1995). In the case of an identifiable mixture,
Castelli and Cover (1995) showed that with an infinite number of unlabeled points,
the probability of error has an exponential convergence (w.r.t. the number of labeled
examples) to the Bayes risk. Identifiable means that given P (x), the decomposition
in y P (y)P (x|y) is unique. This seems a relatively strong assumption, but it is
satisfied, for instance, by mixtures of Gaussians. Related is the analysis in (Castelli
and Cover, 1996) in which the class-conditional densities are known but the class
priors are not.
text applications Finally, the interest in semi-supervised learning increased in the 1990s, mostly
due to applications in natural language problems and text classification (Yarowsky,
1995; Nigam et al., 1998; Blum and Mitchell, 1998; Collins and Singer, 1999;
Joachims, 1999).
Note that, to our knowledge, Merz et al. (1992) were the first to use the term
“semi-supervised” for classification with both labeled and unlabeled data. It has
in fact been used before, but in a different context than what is developed in this
book; see, for instance, (Board and Pitt, 1989).
cluster Suppose we knew that the points of each class tended to form a cluster. Then the
assumption unlabeled data could aid in finding the boundary of each cluster more accurately:
one could run a clustering algorithm and use the labeled points to assign a class
to each cluster. That is in fact one of the earliest forms of semi-supervised learning
(see chapter 2). The underlying, now classical, assumption may be stated as follows:
Cluster assumption: If points are in the same cluster, they are likely to be of the
same class.
This assumption may be considered reasonable on the basis of the sheer existence
2. Strictly speaking, this assumption only refers to continuity rather than smoothness;
however, the term smoothness is commonly used, possibly because in regression estimation
y is often modeled in practice as a smooth function of x.
6 Introduction to Semi-Supervised Learning
A different but related assumption that forms the basis of several semi-supervised
manifold learning methods is the manifold assumption:
assumption
Manifold assumption: The (high-dimensional) data lie (roughly) on a low-dimensional
manifold.
How can this be useful? A well-known problem of many statistical methods and
curse of learning algorithms is the so-called curse of dimensionality (cf. section 11.6.2). It is
dimensionality related to the fact that volume grows exponentially with the number of dimensions,
and an exponentially growing number of examples is required for statistical tasks
such as the reliable estimation of densities. This is a problem that directly affects
generative approaches that are based on density estimates in input space. A related
problem of high dimensions, which may be more severe for discriminative methods,
is that pairwise distances tend to become more similar, and thus less expressive.
If the data happen to lie on a low-dimensional manifold, however, then the
learning algorithm can essentially operate in a space of corresponding dimension,
thus avoiding the curse of dimensionality.
As above, one can argue that algorithms working with manifolds may be seen
1.2 When Can Semi-Supervised Learning Work? 7
1.2.4 Transduction
Although many methods were not explicitly derived from one of the above assump-
tions, most algorithms can be seen to correspond to or implement one or more
of them. We try to organize the semi-supervised learning methods presented in
this book into four classes that roughly correspond to the underlying assumption.
Although the classification is not always unique, we hope that this organization
makes the book and its contents more accessible to the reader, by providing a
guiding scheme.
For the same reason, this book is organized in “parts.” There is one part for each
class of SSL algorithms and an extra part focusing on generative approaches. Two
further parts are devoted to applications and perspectives of SSL. In the following
we briefly introduce the ideas covered by each book part.
Part I presents history and state of the art of SSL with generative models. Chapter 2
starts with a thorough review of the field.
Inference using a generative model involves the estimation of the conditional
density p(x|y). In this setting, any additional information on p(x) is useful. As
a simple example, assume that p(x|y) is Gaussian. Then one can use the EM
mixture models algorithm to find the parameters of the Gaussian corresponding to each class. The
only difference to the standard EM algorithm as used for clustering is that the
“hidden variable” associated with any labeled example is actually not hidden, but
it is known and equals its class label. It implements the cluster assumption (cf.
section 2.2.1), since a given cluster belongs to only one class.
This small example already highlights different interpretations of semi-supervised
learning with a generative model:
data-dependent choice of this prior can be made more precise after seeing the unlabeled data: one
priors could typically put a higher prior probability on functions that satisfy the cluster
assumption. From a theoretical point, this is a natural way to obtain bounds for
semi-supervised learning as explained in chapter 22.
Part II of this book aims at describing algorithms which try to directly implement
the low-density separation assumption by pushing the decision boundary away from
the unlabeled points.
The most common approach to achieving this goal is to use a maximum margin
algorithm such as support vector machines. The method of maximizing the margin
for unlabeled as well as labeled points is called the transductive SVM (TSVM).
transductive However, the corresponding problem is nonconvex and thus difficult to optimize.
SVM (TSVM) One optimization algorithm for the TSVM is presented in chapter 6. Starting
from the SVM solution as trained on the labeled data only, the unlabeled points are
labeled by SVM predictions, and the SVM is retrained on all points. This is iterated
while the weight of the unlabeled points is slowly increased. Another possibility is
the semi-definite programming SDP relaxation suggested in chapter 7.
Two alternatives to the TSVM are then presented that are formulated in a
probabilistic and in an information theoretic framework, respectively. In chapter
8, binary Gaussian process classification is augmented by the introduction of a null
class that occupies the space between the two regular classes. As an advantage over
the TSVM, this allows for probabilistic outputs.
This advantage is shared by the entropy minimization presented in chapter 9. It
encourages the class-conditional probabilities P (y|x) to be close to either 1 or 0 at
labeled and unlabeled points. As a consequence of the smoothness assumption, the
probability will tend to be close to 0 or 1 throughout any high-density region, while
class boundaries correspond to intermediate probabilities.
A different way of using entropy or information is the data-dependent regulariza-
tion developed in chapter 10. As compared to the TSVM, this seems to implement
the low-density separation even more directly: the standard squared-norm regular-
izer is multiplied by a term reflecting the density close to the decision boundary.
During the last couple of years, the most active area of research in semi-supervised
learning has been in graph-based methods, which are the topic of part III of this
book. The common denominator of these methods is that the data are represented
by the nodes of a graph, the edges of which are labeled with the pairwise distances
of the incident nodes (and a missing edge corresponds to infinite distance). If the
distance of two points is computed by minimizing the aggregate path distance over
all paths connecting the two points, this can be seen as an approximation of the
geodesic distance of the two points with respect to the manifold of data points.
10 Introduction to Semi-Supervised Learning
L := I − D−1/2 WD−1/2 ,
(1.3)
L := D − W.
Many graph methods that penalize nonsmoothness along the edges of a weighted
graph can in retrospect be seen as different instances of a rather general family of
algorithms, as is outlined in chapter 11. Chapter 13 takes a more theoretical point
of view, and transfers notions of smoothness from the continuous case onto graphs
as the discrete case. From that, it proposes different regularizers based on a graph
representation of the data.
Usually the prediction consists of labels for the unlabeled nodes. For this reason,
this kind of algorithm is intrinsically transductive, i.e., it returns only the value of
the decision function on the unlabeled points and not the decision function itself.
However, there has been recent work in order to extend graph-based methods to
produce inductive solutions, as discussed in chapter 12.
Information propagation on graphs can also serve to improve a given (possibly
strictly supervised) classification, taking unlabeled data into account. Chapter 14
presents a probabilistic method for using directed graphs in this manner.
Often the graph g is constructed by computing similarities of objects in some
other representation, e.g., using a kernel function on Euclidean data points. But
sometimes the original data already have the form of a graph. Examples include
the linkage pattern of webpages and the interactions of proteins (see chapter 20).
In such cases, the directionality of the edges may be important.
The topic of part IV is algorithms that are not intrinsically semi-supervised, but
instead perform two-step learning:
1. Perform an unsupervised step on all data, labeled and unlabeled, but ignoring
the available labels. This can, for instance, be a change of representation, or the
1.3 Classes of Algorithms and Organization of This Book 11
Semi-supervised learning will be most useful whenever there are far more unlabeled
data than labeled. This is likely to occur if obtaining data points is cheap, but
obtaining the labels costs a lot of time, effort, or money. This is the case in many
application areas of machine learning, for example:
In speech recognition, it costs almost nothing to record huge amounts of speech,
but labeling it requires some human to listen to it and type a transcript.
Billions of webpages are directly available for automated processing, but to
classify them reliably, humans have to read them.
Protein sequences are nowadays acquired at industrial speed (by genome sequenc-
ing, computational gene finding, and automatic translation), but to resolve a three-
dimensional (3D) structure or to determine the functions of a single protein may
require years of scientific work.
Webpage classification is introduced in chapter 3 in the context of generative
models.
Since unlabeled data carry less information than labeled data, they are required
in large amounts in order to increase prediction accuracy significantly. This implies
the need for fast and efficient SSL algorithms. Chapters 18 and 19 present two
approaches to dealing with huge numbers of points. In chapter 18 methods are
developed for speeding up the label propagation methods introduced in chapter 11.
In chapter 19 cluster kernels are shown to be an efficient SSL method.
Chapter 19 also presents the first of two approaches to an important bioinformat-
ics application of semi-supervised learning: the classification of protein sequences.
While here the predictions are based on the protein sequences themselves, Chap-
ter 20 moves on to a somewhat more complex setting: The information is here
assumed to be present in the form of graphs that characterize the interactions of
proteins. Several such graphs exist and have to be combined in an appropriate way.
12 Introduction to Semi-Supervised Learning
This book part concludes with a very practical chapter: the presentation and
evaluation of the benchmarks associated with this book (chapter 21). It is intended
to give hints to the practitioner on how to choose suitable methods based on the
properties of the problem.
1.3.6 Outlook
The last part of the book, part VI, is devoted to some of the most interesting
directions of ongoing research in SSL.
Until now this book has mostly resticted itself to classification. Chapter 23
introduces another approach to SSL that is suited for both classification and
regression, and derives algorithms from it. Interestingly it seems not to require
the assumptions proposed in chapter 1.
Further, this book mostly presented algorithms for SSL. While the assumptions
discussed above supply some intuition on when and why SSL works, and chapter 4
investigates when and why it can fail, it would clearly be more satisfactory to have
a thorough theoretical understanding of SSL in total. Chapter 22 offers a PAC-style
framework that yields error bounds for SSL problems.
In chapter 24 inductive semi-supervised learning and transduction are compared
in terms of Vapnik-Chervonenkis (VC) bounds and other theoretical and philosoph-
ical concepts.
The book closes with a hypothetical discussion (chapter 25) between three
machine learning researchers on the relationship of (and the differences between)
semi-supervised learning and transduction.
I Generative Models
2 A Taxonomy for Semi-Supervised Learning
Methods
The semi-supervised learning (SSL) problem has recently drawn large attention
in the machine learning community, mainly due to its significant importance in
practical applications. In this section we define the problem and introduce the
notation to be used in the rest of this chapter.
In statistical machine learning, we distinguish between unsupervised and super-
vised learning. In the former scenario we are given a sample {xi } of patterns in X
drawn independently and identically distributed (i.i.d.) from some unknown data
distribution with density P (x). Our goal is to estimate the density or a (known)
functional thereof. Supervised learning consists of estimating a functional relation-
ship x → y between a covariate x ∈ X and a class variable1 y ∈ {1, . . . , M }, with
the goal of minimizing a functional of the (joint) data distribution P (x, y) such
as the probability of classification error. The marginal data distribution P (x) is
referred to as input distribution. Classification can be treated as a special case of
estimating the joint density P (x, y), but this is wasteful since x will always be
given at prediction time, so there is no need to estimate the input distribution.
The terminology “unsupervised learning” is a bit unfortunate: the term density
2. While this statement is probably open to debate, it is in fact agreed upon in statistics.
In our opinion, methods should be classified foremost according to the problem they try to
solve, not by which sources of data they make use of. On the other hand, there are problems
in which density estimation is the goal and labeled data are treated as an auxiliary source.
However, these fall into a category with very different characteristics and are not in the
scope of this chapter. In our opinion, it would be very confusing to lump them together
with methods we classify as SSL here. A label like “semi-unsupervised learning” would be
more appropriate.
3. This is a “no free lunch” statement for SSL, but in practice it seems to be a more
serious problem than in the purely supervised context (where a “no free lunch” statement
holds as well). See chapter 4 for some examples.
2.2 Paradigms for Semi-Supervised Learning 17
Since SSL methods are supervised learning techniques, they can be classified
according to the standard taxonomy into generative and diagnostic paradigms. In
this section we present these paradigms and highlight their differences in the case
of SSL. We also note that this taxonomy, which originated for purely supervised
methods, can be ambiguous when applied to SSL, and we suggest how the borderline
can be drawn exactly.
In the figures of this section, we employ a convenient graphical notation frequently
used in statistics and machine learning (Lauritzen, 1996; Jordan, 1999). These so-
called directed graphical models (or independence diagrams) have the following
intuitive semantics. Nodes represent random variables. The parents of a node i are
the nodes j for which a directed edge j → i exists.4 It is possible to sample the
value of a node once the values of all its parents are known. Thus, a graphical model
is a simple way of representing the sampling mechanism from a distribution over
several variables. As such, the graphical model encodes conditional independency
constraints that have to hold for the distribution. In order to sample from the
distribution, we start with nodes without parents and work in the directions of the
edges. We also make use of plates which are rectangular boxes grouping a set of
nodes. This means that the group is sampled repeatedly and independently from
the same distribution (i.i.d.) conditioned on all nodes which are parents of any
plate member. For example, the figure of section 2.2.1 means that we first sample
θ and π independently (neither has parents), then draw a sample {(xi , yi )} i.i.d.
conditioned on θ, π (which are parents of the plate).
Note that we describe the generative and diagnostic paradigm from an explicitly
Bayesian viewpoint. This is somewhat a matter of personal choice here, and
certainly one could sketch these classes without ever mentioning concepts like prior
distributions. On the other hand, the Bayesian view avoids many unnecessary
complications, in that all variables are random, no difference has to be made
between functional and probabilistic independence, and so on, so we do not think
our presentation lacks clarity or generality because of this choice.
4. Directed cycles are not allowed. In other words, it must be impossible to return to a
node by moving along edges and respecting their direction.
18 A Taxonomy for Semi-Supervised Learning Methods
If labeled and unlabeled data are available, a natural criterion emerges as the joint
log likelihood of both Dl and Du ,
n
n+m
M
log πyi P (xi |yi , θ) + log πy P (xi |y, θ), (2.1)
i=1 i=n+1 y=1
or alternatively the posterior P (θ, π|Dl , Du ).6 This is essentially an issue of max-
imum likelihood in the presence of missing data (treating y as a latent variable),
which can in principle be attacked by the expectation-maximization (EM) algorithm
(see section 2.3.1) or by direct gradient descent.
Some researchers have been quick in hailing this strategy as an obvious solution
to the SSL problem, but this is not the case, in about the same sense as generative
methods often do not provide good solutions to classification problems. Generative
techniques provide an estimate of P (x) along the way, although this is not required
for classification, and in general this proves wasteful given limited data. For ex-
ample, maximizing the joint likelihood of a finite sample need not lead to a small
classification error, because depending on the model it may be possible to increase
the likelihood more by improving the fit of P (x) than the fit of P (y|x). This is an
instance of the general problem of balancing the impact of Dl and Du on the final
predictions, especially in the case m ≫ n. This issue is discussed in section 2.3.1.
Furthermore, in the SSL setting y is a latent variable which has to be summed
out on Du , leading to highly multimodal posteriors, so that likelihood or posterior
maximization techniques are plagued by the presence of very many (local) minima.
which implies that P (θ|Dl , Du ) ∝ P (Yl |X l , θ)P (θ), i.e. P (θ|Dl , Du ) = P (θ|Dl ),
and θ and µ are a posteriori independent. Furthermore, P (θ|Dl , µ) = P (θ|Dl ).
This means that neither knowledge of the unlabeled data Du nor any knowledge
of µ changes the posterior belief P (θ|Dl ) of the labeled sample. Therefore, in the
standard data generation model for diagnostic methods, unlabeled data Du cannot
be used for Bayesian inference, and modeling the input distribution P (x) is not
necessary. There are non-Bayesian diagnostic techniques in which we can make use
of Du (see section 2.3.2), but the impact of doing so (as opposed to ignoring Du ) is
usually very limited. In order to make significant use of unlabeled data in diagnostic
methods, the data generation model discussed above has to be modified as discussed
in the following section.
When learning from a sample Dl of limited size, typically very many associations
x → y are consistent with the data. The idea of regularization is to bias our choice
of classifier toward “simpler” hypotheses, by adding a regularization functional
to the criterion to be minimized which grows with complexity. Here, the notion of
simplicity depends on the task and the model setup. For example, for a linear model
it is customary to penalize a norm of the weight vector, and for some commonly
used regularization functionals this can be shown to be equivalent to placing a
20 A Taxonomy for Semi-Supervised Learning Methods
zero-mean prior distribution on the weight vector. From now on we will only be
interested in regularization by priors and will use the terms interchangeably.
We have seen in section 2.2.2 that with straight diagnostic Bayesian methods
for classification, we cannot make use of additional unlabeled data Du , because θ
(parameterizing P (y|x)) and µ (parameterizing P (x)) are a priori independent. In
other words, the model family {P (y|x, θ)} is regularized independently of the input
distribution.
If we allow prior dependencies between
θ and µ, e.g.
P (θ, µ) = P (θ|µ)P (µ) and P (θ) = P (θ|µ)P (µ) dµ (as
shown in the independence diagram to the right), the situ-
µ θ
ation is different. The conditional prior P (θ|µ) in principle
allows information about µ to be transferred to θ. In gen-
eral, θ and Du will be dependent given the labeled data Dl ,
therefore unlabeled data can change our posterior belief in x y
θ.
We conclude that to make use of additional unlabeled data within the context
of diagnostic Bayesian supervised techniques, we have to allow an a priori depen-
dence between the latent function representing the conditional probability and the
input probability itself. In other words, we have to use a regularization of the latent
function which depends on the input distribution. The potential gain can be demon-
strated by the following argument. Note that conditional priors imply a marginal
prior P (θ) which is a mixture distribution: P (θ) = P (θ|µ)P (µ) dµ. By condi-
tioning on the unlabeled data, this is replaced by P (θ|Du ) = P (θ|µ)P (µ|Du ) dµ
which can have a much smaller entropy than P (θ), implying that the posterior be-
lief P (θ|Dl , Du ) can be much narrower than P (θ|Dl ). On the other hand, the same
argument can be used to demonstrate that using additional unlabeled data Du can
hurt instead of help. Namely, if the priors P (θ|µ) enforce certain constraints very
rigidly, but these happen to be violated in the true distribution P (x, y), the con-
ditional “prior” P (θ|Du ) will assign much lower probability than P (θ) to models
P (y|x, θ) close to the truth, and the posterior P (θ|Dl , Du ) can be concentrated
around suboptimal models. While it is certainly easy to construct artificial situ-
ations where additional unlabeled data hurt, it is worrying that such failures do
happen quite unexpectedly in practically relevant settings as well. For a more thor-
ough analysis of this problem, see Cozman and Cohen (chapter 4 in this volume).
We note that while the modification to the standard data generation model
for diagnostic methods suggested here is straightforward, choosing appropriate
conditional priors P (θ|µ) suitable for a task at hand can be challenging. However,
several general techniques for SSL can actually be seen as realizing input-dependent
regularization, as is demonstrated in section 2.3.3.
The reader may feel uneasy at this point. If we use a priori dependent θ and µ, the
final predictive distribution depends on the prior P (µ) over the input distribution.
This forces us to model the input distribution itself, in contrast to the situation
for standard diagnostic methods. In this case, will our method still be a diagnostic
one? Is it not the case that any method which models P (x) in some way must
2.2 Paradigms for Semi-Supervised Learning 21
imply the desired properties of P (y|x) and P (x). In contrast to that, in a diagnostic
method we model P (y|x) directly, and also typically have considerable freedom in
modeling P (x). In SSL we regularize the P (y|x) estimates using information from
P (x), but we do not have to specify the class-conditional distributions explicitly.7
While this definition is workable for the SSL methods mentioned here, it may be too
restrictive on the generative side. For example, the “many-centers-per-class” model
of section 2.3.1 is clearly generative, but works with a mixture model for P (x)
which has several components for each class y, and P (x|y) is modeled indirectly
via P (x|y) = k πy βy,k P (x|k), i.e., as a mixture itself. In the following paragraph
we suggest an alternative view which leaves more freedom for generative techiques.
The practical success of SSL has shown that unlabeled data, i.e., knowledge about
P (x), can be useful for supervised tasks, but it is not necessarily the same type
of knowledge that would lead to a good estimate of P (x) according to common
performance criteria for density estimation. In fact, it is actually a few general
characteristics of P (x) which seem to help classification (see e.g.: section 2.3.3.1).
For example, if we convert a purely diagnostic technique such as SVM or logistic
regression into an SSL technique by employing a regularizer penalizing P (y|x)
estimates which violate certain aspects of P (x) such as the cluster assumption (see
section 2.3.3.1), the influence of P (x) on the final P (y|x) estimate is restricted
to just these aspects that we hope are important for better classification. These
restrictions are engineered by us because we want to make best use of D u in order to
predict P (y|x). In contrast, if we perform SSL by maximizing a suitably reweighted
version of the joint log likelihood (2.1) of a mixture model (see section 2.3.1), such
a restriction to classification-relevant aspects is not given or at least not directly
planned. In fact the joint model is designed in much the same way as we would do
for density estimation.
For example, consider the framework of conditional priors of section 2.2.3. While
it is essential to learn about P (x) in SSL, the impact of an oversimple model for
P (x) on the final prediction is much less severe than in density estimation. This is
because a suitable regularization will only depend on certain aspects of P (x) (e.g.,
on the coarse locations of high-density regions under the cluster assumption; see
section 2.3.3.1), and our model for the x distribution only has to be able to capture
those accurately.
2.3 Examples
In this section we provide examples of SSL methods falling in each of the categories
introduced in the previous section. We do not try to provide a comprehensive
7. There are, of course, class-conditional distributions which are implied by the models of
P (y|x) and P (x) (use the Bayes formula), but importantly we do not have to work with
them directly, so that their form is not restricted by tractability requirements.
2.3 Examples 23
literature review here (see (Seeger, 2000b) for review of work up to about 2001),
but are selective in order to point out characteristics of and differences between the
categories. Note that in this context (and also in (Seeger, 2000b)) some methods are
classified as “baseline methods.” This does not constitute a devaluation, and in fact
some of these methods belong to the top performers on some tasks. Furthermore, we
think that theoretical analyses of such methods are of great value, not least because
many practitioners use them. Our label applies to methods which can be derived
fairly straightforwardly from standard unsupervised or supervised methods, and
we hope that truly novel proposals are in fact compared against the most closely
related baseline methods.
Recall from section 2.2.1 that generative techniques use a model family {P (x, y|θ, π)}
in order to model the joint data distribution P (x, y). The simplest idea is to run a
mixture density estimation method for P (x) on X l ∪X u , treating y as a latent class
variable, then using the labeled sample Dl in order to associate latent classes with
actual ones. An obvious problem with this approach is that the labeling provided by
the unsupervised method may be inconsistent with Dl , in which case the clustering
should be modified to achieve consistency with Dl . Castelli and Cover (Castelli
and Cover, 1995) provide a simple analysis of this baseline method under fairly
unrealistic identifiability conditions. Namely, they assume that the data distribu-
tion is exactly identifiable by the unsupervised method at hand, which employs a
mixture model with one component for each class. It is not clear how to achieve
this in practice, even if P (x) is exactly known.8 In the large-sample limit, all class
distributions can be learned perfectly, but the assignment of classes to label names
obviously remains completely open. However, only a few additional labeled points
are required in order to learn this assignment. In fact, it is easy to see that the
error rate converges to the Bayes error exponentially fast (in the number of labeled
examples drawn from P (x, y)).
Another baseline method consists of maximizing the joint likelihood of Eq. 2.1.
For m > 0, the criterion to be minimized is not convex and typically multimodal,
so we have to contend ourselves with finding a local maximum. This can be
done by direct gradient descent or more conveniently by applying the expectation-
maximization (EM) algorithm (Dempster et al., 1977). The latter is an iterative
procedure which is guaranteed to converge to a local maximum of the likelihood.
If all data in Eq. 2.1 were labeled, a local maximum would be found by a single
optimization over θ. In fact, if the class-conditional distributions P (x|y, θ) are from
E and M steps are iterated until convergence. It is easy to show that φ is a lower
bound on the joint log likelihood (2.1) for any choice of q on the unlabeled points.
The bound becomes an equality if the q are chosen as posteriors and the parameters
θ, π are not changed. Furthermore, under this choice the gradient of lower bound
and joint log likelihood are the same at θ, π, so that if EM converges we have found
a local maximum of Eq. 2.1.
The idea of using EM on a joint generative model to train on labeled and
unlabeled data is almost as old as EM itself. Titterington et.al. (Titterington et al.,
1985, section 5.7) review early theoretical work on the problem of discriminant
analysis in the presence of additional unlabeled data. The most common assumption
is that the data have been generated from a mixture of two Gaussians with equal
covariance matrices, in which case the Bayes discriminant is linear. They analyze
the “plug-in” method from the generative paradigm (see section 2.2.1) in which the
parameters of the class distributions are estimated by maximum likelihood. If the
two Gaussians are somewhat well separated, the asymptotic gain of using unlabeled
samples is very significant. For details, see (O’Neill, 1978; Ganesalingam and
McLachlan, 1978, 1979). McLachlan (McLachlan, 1975) gives a practical algorithm
for this case which is essentially a “hard” version of EM, i.e. in every E step the
unlabeled points are allocated to one of the populations, using the discriminant
derived from the mixture parameters of the previous step (note that the general
EM algorithm had not been proposed at that time). He proves that for “moderate-
sized” training sets from each population and for a pool Du of points sampled
from the mixture, if the algorithm is initialized with the maximum-likelihood (ML)
solution based on the labeled data, the solutions computed by the method converge
almost surely against the true mixture distribution with |Du | = m → ∞. These
early papers provide some important insight into properties of the semi-supervised
problem, but their strict assumptions limit the conclusions that can be drawn for
large real-world problems.
The EM algorithm has been applied to text classification by Nigam et.al. (see
(Nigam et al., 2000), or chapter 3 in this book). From Eq. 2.1 we see that in the
joint log likelihood, labeled and unlabeled data are weighted at the ratio n to m.
This “natural” weighting makes sense if the likelihood is taken at face value, i.e. as
a correct description of the sampling mechanism for the data, but it is somewhat
2.3 Examples 25
irrelevant to the problem of SSL where a strong sampling bias is present whose
exact size is usually unknown. In other words, unlabeled data are often available
in huge quantities simply because they can be obtained much cheaper than labeled
data. If we use the natural weighting in the interesting case m ≫ n, the labeled data
Dl are effectively ignored. Nigam et.al. suggest reweighting the terms in Eq. 2.1 by
(1 − λ)/n and λ/m respectively (the natural weighting is given by λ = m/(m + n))
and adjusting λ by standard techniques such as cross-validation on Dl .
Note that y is treated as the latent class variable as far
as the estimation of P (x) from Du is concerned, and we can
just as well allow for more mixture components than classes. k
Namely, we can introduce an additional separator variable k
such that under the model x and y are independent given
k. This means that all the information x contains about its
class y is already captured in k. This fact is illustrated in the x y
independence model on the right.
The reweighted joint log-likelihood is
n n+m
1−λ λ
log βyi ,k πk P (xi |k, θ) + log πk P (xi |k, θ),
n i=1 m i=n+1
k k
9. These are not very restrictive; for example, they hold for all (regular) exponential
families.
26 A Taxonomy for Semi-Supervised Learning Methods
follow is independent of the label information. They show how to employ homotopy
continuation (path-following) methods in order to trace the solution path up to λ∗
fairly efficiently. By restricting themselves to λ ≤ λ∗ they circumvent the many
(local) maxima problem, and their choice of λ = λ∗ is well motivated.
Murray and Titterington [1978] (see also (Titterington et al., 1985), ex. 4.3.11)
suggest using Dl for each class to obtain kernel-based estimates of the densities
P (x|y). They fix these estimates and use EM in order to maximize the joint
likelihood of Dl , Du w.r.t. the mixing coefficients πt only.10 This procedure is
robust, but does not make a lot of use of the unlabeled data. If Dl is small, the
kernel-based estimates of the P (x|y) will be poor, and even if Du can be used to
obtain better values for the mixing coefficients, this is not likely to rescue the final
discrimination. Furthermore, the procedure has been suggested for situations where
the natural weighting between Dl , Du is appropriate, which is typically not the case
for SSL.
Shahshahani and Landgrebe (Shahshahani and Landgrebe, 1994) provide an
analysis aimed toward the general question whether unlabeled data can help in
classification, based on methods originating in asymptotic maximum-likelihood
theory. Their argumentation is somewhat unclear and has been criticized by various
other authors (e.g., (Nigam et al., 2000; Zhang and Oles, 2000)). They do not define
model classes and seem to confuse asymptotic and finite-sample terms. After all,
their claim seems to be that unlabeled data can reduce the asymptotic variance of
an estimator, but they do not worry about the fact that such modifications could
actually introduce new bias, especially in the interesting case where m ≫ n. On the
practical side, the algorithm they suggest is the joint EM scheme discussed above.
Another analysis of SSL which also employs Fisher information, is given by Zhang
and Oles (Zhang and Oles, 2000). The authors show that for purely diagnostic
models, unlabeled data cannot help (this fact has of course been known for a long
time; see also section 2.2.2). In the generative setup, they show that D u can only
help. While this is true under their assumptions, it draws on asymptotic concepts
and may not be relevant in practical situations. The Fisher information charac-
terizes the minimal asymptotic variance of an unbiased estimator only, and the
maximum-likelihood estimator is typically only asymptotically unbiased. Applying
such concepts to the case where Dl is small cannot lead to strong conclusions, and
the question of (even asymptotic) bias remains in the case where m grows much
faster than n. On the practical side, some empirical evidence is presented on a
text categorization task which shows that unlabeled data can lead to instabilities
in common transduction algorithms and therefore “hurt” (see comments in section
2.2.3).
10. EM w.r.t. the mixing coefficients only always converges to a unique global optimum.
It is essentially a variant of the Blahut-Arimoto algorithm to compute the rate distortion
function which is important for quantization (see (Cover and Thomas, 1991)).
2.3 Examples 27
We noted in section 2.2.2 that unlabeled data cannot be used in Bayesian diagnostic
methods if θ and µ are a priori independent, so in order to make use of Du we
have to employ conditional priors P (θ|µ). Unlabeled data may still be useful in
non-Bayesian settings. An example has been given by Tong and Koller (Tong and
Koller, 2000) under the name of restriced Bayes optimal classification (RBOC).
Consider a diagnostic method in which the sum of an empirical loss term and a
regularization functional is minimized. The empirical loss term is the expectation
w.r.t. the labeled sample Dl of a loss function relevant to the problem (e.g., the zero-
one loss L(x, y, h) = I{y=h(x)} ). The authors suggest incorporating unlabeled data
Du by estimating P (x, y) from Dl ∪Du , then replacing the empirical loss term by the
expectation of the loss under this estimate. The regularization term is not changed.
We can compare this method directly with input-dependent regularization (see
section 2.2.3). In the former, the empirical loss part (the negative log likelihood for
a probabilistic model) is modified based on Du ; in the latter it is the regularization
term. We would not expect RBOC to produce very different results from the
corresponding diagnostic technique, especially if n is rather small (which is the
interesting case in practice). This is somewhat confirmed by the weak results in
(Tong and Koller, 2000). A very similar idea is proposed in (Chapelle et al., 2001)
in order to modify the diagnostic SVM framework.
Anderson (Anderson, 1979) suggested an interesting modification of logistic
regression in which unlabeled data can be used. In binary logistic regression, the log
odds are modeled as linear function, which gives P (x|1) = exp(β T x)P (x|2) and
P (x) = (π1 exp(β T x)+1−π1 )P (x|2), where π1 = P {t = 1}. Anderson now chooses
the parameters β, π1 and P (x|2) in order to maximize the likelihood of both Dl
and Du , subject to the constraints that P (x|1) and P (x|2) are normalized. For
finite X, this problem can be transformed into an unconstrained optimization w.r.t.
the parameters β, π1 . For a continuous input variable x, Anderson advocates using
the form of P (x|2) derived for the “finite X” case, although this is not a smooth
function. Unfortunately, it is not clear how to generalize this idea to more realistic
models, for example how to “kernelize” it, and the form of P (x|2) is inadequate for
many problems with infinite X.
It is not hard to construct “malicious” examples of P (x, y) which defy any given
dependence assumption on θ, µ. However, in practice it is often the case that
cluster structure in the data for x indeed is mostly consistent with the labeling.
It is not very fruitful to speculate about why this is the case, although certainly
there is a selection bias toward features (i.e. components in x) which are relevant
w.r.t. the labeling process, which means they should group in the same way (w.r.t.
a simple distance) as labelings. The cluster assumption (CA) (e.g., (Seeger, 2000b))
provides a general way of exploiting this observation for SSL. It postulates that two
points x′ , x′′ should have the same label y with high probability if there is a “path”
between them in X which moves through regions of significant density P (x) only. In
other words, a discrimination function between the classes should be smooth within
connected high-density regions of P (x). Thus, the CA can be compared directly
with global smoothness assumptions requiring the discriminant to change smoothly
everywhere, independent of P (x). While the latter penalize sharp changes also in
regions which will be sparsely populated by training and test data, the CA remains
indifferent there.
The CA is implemented (to different extent) in a host of methods proposed
for SSL. Most prominent are probably label propagation methods (Szummer and
Jaakkola, 2002b; Belkin and Niyogi, 2003b; Zhu et al., 2003b). The rough idea
is to construct a graph with vertices from X l ∪ X u which contains the test
set to be labeled and all of X l . Nearest neighbors are joined by edges with a
weight proportional to local correlation strength. We then initialize the nodes
corresponding to X l with the labels Yl and propagate label distributions over the
remaining nodes in the manner of a Markov chain on the graph (Szummer and
Jaakkola, 2002b). It is also possible to view the setup as a Gaussian field with
the graph and edge weights specifying the inverse covariance matrix (Zhu et al.,
2003b). Label propagation techniques implement the CA relative to unsupervised
spectral clustering (Belkin and Niyogi, 2003b). The CA has been implemented for
kernel machines by way of the cluster kernel (Chapelle et al., 2003). Furthermore,
the generative SSL techniques of section 2.3.1 can be seen as implementing the CA
relative to a mixture model clustering.
A generalization of the CA has been given by Corduneanu and Jaakkola (see
chapter 10 in this book) who show how to obtain a regularizer for the conditional
distribution P (y|x) from information-theoretic arguments.
The Fisher kernel was proposed in (Jaakkola and Haussler, 1999) in order to
exploit additional unlabeled data within a kernel-based support vector machine
(SVM) framework for detecting remote protein homologies. The idea is to fit a
generative model P (x|µ) to Du by maximum likelihood (resulting in µ̂, say). If
x are DNA sequences, a hidden Markov model (HMM) can be employed. P (x|µ̂)
2.3 Examples 29
represents the knowledge extracted from Du , and the Fisher kernel is a general way
of constructing a covariance kernel Kµ̂ which depends on this knowledge. We can
then fit an SVM or a Gaussian process (GP) classifier to Du using the kernel Kµ̂ .
Identifying this setup as an instance of input-dependent regularization is easiest in
the GP context. Here, θ is a process representing the discriminant function (we
assume c = 2 for simplicity), and P (θ|µ) is a GP distribution with zero-mean
function and covariance kernel Kµ . In the ML context, P (µ|Du ) is approximated
by the delta distribution δµ̂ .
Define the Fisher score to be Fµ̂ (x) = ∇µ̂ log P (x|µ) (the gradient
w.r.t. µ is
evaluated at µ̂). The Fisher information matrix is F = EP (·|µ̂) Fµ̂ (x)Fµ̂ (x)T .
The naive Fisher kernel is Kµ̂ (x, x′ ) = Fµ̂ (x)T F −1 Fµ̂ (x′ ). In a variant, F is
replaced by αI for a scale parameter α. Other variants of the Fisher kernel are
obtained by using the Fisher score Fµ̂ (x) as feature vector for x and plugging
these into a standard kernel such as the Gaussian radial basis function (RBF) one.
The latter “embeddings” seem to be more useful in practice. The Fisher kernel can
be motivated from various angles (see (Jaakkola and Haussler, 1999)), for example,
as first-order approximation to a sample mutual information between x, x′ (Seeger,
2002).
2.3.3.3 Co-Training
Co-training was introduced by Blum and Mitchell (Blum and Mitchell, 1998) and is
related to earlier work on unsupervised learning (Becker and Hinton, 1992). The idea
is to make use of different “views” on the objects to be classified (here we restrict
ourselves to binary classification, c = 2, and to two views). For example, a webpage
can be represented by the text on the page, but also by the text of hyperlinks
referring to the page. We can train classifiers separately which are specialized to
each of the views, but in this context unlabeled data Du can be helpful in that,
although the true label is missing, it must be the same for all the views. It turns out
that co-training can be seen as a special case of Bayesian inference using conditional
priors (see section 2.2.3), as is demonstrated below in this section.
Let X = X(1) × X(2) be a finite or countable input space. If x = (x(1) , x(2) ), the
x(j) are different “views” on x. We are also given spaces Θ(j) of concepts (binary
classifiers) θ (j) . Elements θ = (θ (1) , θ (2) ) ∈ Θ = Θ(1) × Θ(2) are called concepts
over X, although we may have θ (1) (x(1) ) = θ (2) (x(2) ) for some x = (x(1) , x(2) ) ∈ X.
Whenever the θ (j) agree, we write θ(x) = θ (1) (x(1) ). If A ⊂ X, we say that
a concept θ = (θ (1) , θ(2) ) is compatible with A if θ (1) (x(1) ) = θ(2) (x(2) ) for all
x = (x(1) , x(2) ) ∈ A. Denote by Θ(A) the space of all concepts compatible with
A.11 If Q(x) is a distribution over X with support S = supp Q(x) = {x|Q(x) > 0},
we say that a concept θ is compatible with the distribution Q if it is compatible
11. In order not to run into trivial problems, we assume that Θ(A) is never empty, which
can be achieved by adding the constant concept 1 to both Θ(j) .
30 A Taxonomy for Semi-Supervised Learning Methods
with S.
In the co-training setting, there is an unknown input distribution P (x). A
target concept θ is sampled from some unknown distribution over Θ, and the
data distribution is P (y|x) = I{θ(x)=y} if θ ∈ Θ({x}), 1/2 otherwise.12 However,
the central assumption is that the target concept θ is compatible with the input
distribution P (x). More specifically, the support of the concept distribution must be
contained in Θ(supp P (x)). Therefore, unlabeled data Du can be used by observing
that Θ(supp P (x)) ⊂ Θ(Du ∪X l ), so the effective concept space can be shrunk from
Θ to Θ(Du ∪ X l ).
We demonstrate that co-training can be understood as Bayesian inference with
conditional priors encoding the compatibility assumption. We model P (x) by
{P (x|µ)} and introduce the variable S = supp P (x|µ) for convenience, then define
P (θ|µ) = P (θ|S) as
where fS (θ) > 0, and all P (θ|S) are properly normalized. For example, if Θ(S)
is finite, we can choose fS (θ) = |Θ(S)|−1 . The likelihood is given by P (y|x, θ) =
(1/2)(I{θ(1) (x(1) )=y} + I{θ(2) (x(2) )=y} ) (noiseless case). Since P (θ|S) = 0 for θ ∈
Θ(S), the conditional prior encodes the compatibility assumption. The posterior
belief about θ is given by
P (θ|Dl , Du ) ∝ I{θ (xi )=yi , i=1,...,n} P (θ|S)P (S|X l , Du ) dS,
12. Here, IE is 1 if E is true, 0 otherwise. The scenario is called noiseless because the only
source of randomness is the uncertainty in the target function.
2.4 Conclusions 31
2.4 Conclusions
For several decades, statisticians have advocated using a combination of labeled and
unlabeled data to train classifiers by estimating parameters of a generative model
through iterative expectation-maximization (EM) techniques. This chapter explores
the effectiveness of this approach when applied to the domain of text classification.
Text documents are represented here with a bag-of-words model, which leads to
a generative classification model based on a mixture of multinomials. This model
is an extremely simplistic representation of the complexities of written text. This
chapter explains and illustrates three key points about semi-supervised learning
for text classification with generative models. First, despite the simplistic repre-
sentation, some text domains have a high positive correlation between generative
model probability and classification accuracy. In these domains, a straightforward
application of EM with the naive Bayes text model works well. Second, some text
domains do not have this correlation. Here we can adopt a more expressive and ap-
propriate generative model that does have a positive correlation. In these domains,
semi-supervised learning again improves classification accuracy. Finally, EM suffers
from the problem of local maxima, especially in high-dimension domains such as
text classification. We demonstrate that deterministic annealing, a variant of EM,
can help overcome the problem of local maxima and increase classification accuracy
further when the generative model is appropriate.
3.1 Introduction
The idea of learning classifiers from a combination of labeled and unlabeled data
is an old one in the statistics community. At least as early as 1968, it was
34 Semi-Supervised Text Classification Using EM
suggested that labeled and unlabeled data could be combined to build classifiers
with likelihood maximization by testing all possible class assignments (Hartley
and Rao, 1968). The seminal paper by Day (1969) presents an iterative EM-like
approach for parameters of a mixture of two normals with known covariances
from unlabeled data alone. Similar iterative algorithms for building maximum-
likelihood classifiers from labeled and unlabeled data with an explicit generative
model followed, primarily for mixtures of normal distributions (McLachlan, 1975;
Titterington, 1976).
Dempster et al. (1977) presented the theory of the EM framework, bringing to-
gether and formalizing many of the commonalities of previously suggested iterative
techniques for likelihood maximization with missing data. Its applicability to es-
timating maximum likelihood (or maximum a posteriori) parameters for mixture
models from labeled and unlabeled data (Murray and Titterington, 1978) and then
using this for classification (Little, 1977) was recognized immediately. Since then,
this approach continues to be used and studied (McLachlan and Ganesalingam,
1982; Ganesalingam, 1989; Shahshahani and Landgrebe, 1994). Using likelihood
maximization of mixture models for combining labeled and unlabeled data for clas-
sification has more recently made its way to the machine learning community (Miller
and Uyar, 1996; Nigam et al., 1998; Baluja, 1999).
The theoretical basis for expectation-maximization shows that with sufficiently
large amounts of unlabeled data generated by the model class in question, a more
probable model can be found than if using just the labeled data alone. If the
classification task is to predict the latent variable of the generative model, then
with sufficient data a more probable model will also result in a more accurate
classifier.
This approach rests on the assumption that the generative model is correct.
When the classification task is one of classifying human-authored texts (as we
consider here) the true generative model is impossible to parameterize, and instead
practitioners tend to use very simple representations. For example, the commonly
used naive Bayes classifier represents each authored document as a bag of words,
discarding all word-ordering information. The generative model for this classifier
asserts that documents are created by a draw from a class-conditional multinomial.
As this is an extreme simplification of the authoring process, it is interesting to
ask whether such a generative modeling approach to semi-supervised learning is
appropriate or beneficial in the domain of text classification.
This chapter demonstrates that generative approaches are appropriate for semi-
supervised text classification when the selected generative model probabilities are
well correlated with classification accuracy, and when suboptimal local maxima
can be mostly avoided. In some cases, the naive Bayes generative model, despite its
simplicity, is sufficient. We find that model probability is strongly correlated with
classification accuracy, and expectation-maximization techniques yield classifiers
with unlabeled data that are significantly more accurate than those built with
labeled data alone. In other cases, the naive Bayes generative model is not well
correlated with classification accuracy. By adopting a more expressive generative
3.2 A Generative Model for Text 35
model, accuracy and model probability correlations are restored, and again EM
yields good results.
One of the pitfalls of EM is that it only guarantees the discovery of local maxima
and not global maxima in model probability space. In domains like text classifica-
tion, with a very large number of parameters, this effect can be very significant.
We show that when model probability and classification are well correlated, the use
of deterministic annealing, an alternate modeling estimation process, finds more
probable and thus more accurate classifiers.
Nongenerative approaches have also been used for semi-supervised text classifica-
tion. Joachims (1999) uses transductive support vector machines to build discrimi-
native classifiers for several text classification tasks. Blum and Mitchell (1998) use
the co-training setting to build naive Bayes classifiers for webpages, using anchor
text and the page itself as two different sources of information about an instance.
Zelikovitz and Hirsh (2000) use unlabeled data as background knowledge to aug-
ment a nearest-neighbor classifier. Instead of matching a test example directly to
its closest labeled example, they instead match a test example to a labeled example
by measuring their similarity to a common set of unlabeled examples.
This chapter proceeds as follows. Section 3.2 presents the generative model used
for text classification and shows how to perform semi-supervised learning with EM.
Section 3.3 shows an example where this approach works well. Section 3.4 presents
a more expressive generative model that works when the naive Bayes assumption
is not sufficient, and experimental results from a domain that needs it. Section 3.5
presents deterministic annealing and shows that this finds model parameterizations
that are much more probable than those found by EM, especially when labeled data
are sparse.
This section presents a framework for characterizing text documents and shows how
to use this to train a classifier from labeled and unlabeled data. The framework
defines a probabilistic generative model, and embodies three assumptions about
the generative process: (1) the data are produced by a mixture model, (2) there
is a one-to-one correspondence between mixture components and classes, and (3)
the mixture components are multinomial distributions of individual words. These
are the assumptions used by the naive Bayes classifier, a commonly used tool
for standard supervised text categorization (Lewis, 1998; McCallum and Nigam,
1998a).
We assume documents are generated by a mixture of multinomials model, where
each mixture component corresponds to a class. Let there be M classes and a
vocabulary of size |X|; each document xi has |xi | words in it. How do we create a
document using this model? First, we roll a biased M -sided die to determine the
class of our document. Then, we pick up the biased |X|-sided die that corresponds
to the chosen class. We roll this die |xi | times, and count how many times each
36 Semi-Supervised Text Classification Using EM
P(xi |θ) = P(cj |θ)P(xi |cj ; θ). (3.1)
j∈[M]
This formulation embodies the standard naive Bayes assumption: that the words
of a document are conditionally independent of the other words in the same
document, given the class label.
Thus the parameters of an individual mixture component define a multinomial
distribution over words, i.e. the collection of word probabilities, each written
θwt |cj , such that θwt |cj ≡ P(wt |cj ; θ), where t ∈ [|X|] and t P(wt |cj ; θ) = 1.
Since we assume that for all classes, document length is identically distributed, it
does not need to be parameterized for classification. The only other parameters
of the model are the mixture weights (class probabilities),θcj ≡ P(cj |θ), which
indicate the probabilities of selecting the different mixture components. Thus the
complete collection of model parameters, θ, defines a set of multinomials and class
probabilities: θ = {θwt |cj : wt ∈ X, cj ∈ [M ] ; θcj : cj ∈ [M ]}.
To summarize, the full generative model, given by combining Eqs. 3.1 and 3.2,
assigns probability P (xi |θ) to generating document xi as follows:
P(xi |θ) ∝ P(|xi |) P(cj |θ) P(wt |cj ; θ)xit (3.3)
j∈[M] wt ∈X
where the set of word counts xit is a sufficient statistic for the parameter vector θ
in this generative model.
Learning a naive Bayes text classifier from a set of labeled documents consists of
estimating the parameters of the generative model. The estimate of the parameters
θ is written θ̂. Naive Bayes uses the maximum a posteriori (MAP) estimate, thus
finding arg maxθ P(θ|X, Y ). This is the value of θ that is most probable given the
evidence of the training data and a prior.
Our prior distribution is formed with the product of Dirichlet distributions—one
for each class multinomial and one for the overall class probabilities. The Dirichlet
is the commonly used conjugate prior distribution for multinomial distributions.
The form of the Dirichlet is
where the αt are constants greater than zero. We set all αt = 2, which corresponds
to a prior that favors the uniform distribution. This is identical to Laplace and
m-estimate smoothing. A well-presented introduction to Dirichlet distributions is
given by Stolcke and Omohundro (1994).
The parameter estimation formulas that result from maximization with the data
and our prior are the familiar smoothed ratios of empirical counts. The word
probability estimates θ̂wt |cj are
1 + xi ∈X δij xit
θ̂wt |cj ≡ P(wt |cj ; θ̂) = |X| , (3.5)
|X| + s=1 xi ∈X δij xis
where δij is given by the class label: 1 when yi = cj and 0 otherwise.
The class probabilities, θ̂cj , are estimated in the same manner, and also involve
a ratio of counts with smoothing:
38 Semi-Supervised Text Classification Using EM
1 + |X|
i=1 δij
θ̂cj ≡ P(cj |θ̂) = . (3.6)
M + |X|
The derivation of these ratios-of-counts formulas comes directly from maximum
a posteriori parameter estimation. Finding the θ that maximizes P(θ|X, Y ) is
accomplished by first breaking this expression into two terms by the Bayes rule:
P(θ|X, Y ) ∝ P(X, Y |θ)P(θ). The first term is calculated by the product of all
the document likelihoods (from Eq. 3.1). The second term, the prior distribution
over parameters, is the product of Dirichlets. The whole expression is maximized
by solving the system of partial derivatives of log(P(θ|X, Y )), using Lagrange
multipliers to enforce the constraint that the word probabilities in a class must
sum to one. This maximization yields the ratio of counts seen above.
Given estimates of these parameters calculated from labeled training documents,
it is possible to turn the generative model backward and calculate the probability
that a particular mixture component generated a given document to perform
classification. This follows from an application of the Bayes rule:
If the task is to classify a test document xi into a single class, then the class with
the highest posterior probability, arg maxj P(yi = cj |xi ; θ̂), is selected.
In the semi-supervised setting with labeled and unlabeled data, we would still like
to find MAP parameter estimates, as in the supervised setting above. Because there
are no labels for the unlabeled data, the closed-form equations from the previous
section are not applicable. However, using the EM technique, we can find locally
MAP parameter estimates for the generative model.
The EM technique as applied to the case of labeled and unlabeled data with
naive Bayes yields a straightforward and appealing algorithm. First, a naive Bayes
classifier is built in the standard supervised fashion from the limited amount of
labeled training data. Then, we perform classification of the unlabeled data with
the naive Bayes model, noting not the most likely class but the probabilities
associated with each class. Then, we rebuild a new naive Bayes classifier using all the
data—labeled and unlabeled—using the estimated class probabilities as true class
labels. This means that the unlabeled documents are treated as several fractional
documents according to these estimated class probabilities. We iterate this process
of classifying the unlabeled data and rebuilding the naive Bayes model until it
3.2 A Generative Model for Text 39
converges to a stable classifier and set of labels for the data. This is summarized in
algorithm 3.1.
l(θ|X, Y ) = log(P(θ)) + log P(cj |θ)P(xi |cj ; θ)
xi ∈Xu j∈[M]
+ log (P(yi = cj |θ)P(xi |yi = cj ; θ)) . (3.8)
xi ∈Xl
(We have dropped the constant terms for convenience.) Notice that this equation
contains a log of sums for the unlabeled data, which makes a maximization by
partial derivatives computationally intractable. The formalism of EM (Dempster
et al., 1977) provides an iterative hill-climbing approach to finding local maxima
of model probability in parameter space. The E step of the algorithm estimates
the expectations of the missing values (i.e., unlabeled class information) given the
40 Semi-Supervised Text Classification Using EM
latest iteration of the model parameters. The M step maximizes the likelihood of
the model parameters using the previously computed expectations of the missing
values as if they were the true ones.
In practice, the E step corresponds to performing classification of each unlabeled
document using Eq. 3.7. The M step corresponds to calculating a new maximum a
posteriori (MAP) estimate for the parameters, θ̂, using Eqs. 3.5 and 3.6 with the
current estimates for P(cj |xi ; θ̂).
Essentially all initializations of the parameters lead to some local maxima with
EM. Many instantiations of EM begin by choosing a starting model parameteri-
zation randomly. In our case, we can be more selective about the starting point
since we have not only unlabeled data but also some labeled data. Our iteration
process is initialized with a priming M step, in which only the labeled documents
are used to estimate the classifier parameters, θ̂, as in Eqs. 3.5 and 3.6. Then the
cycle begins with an E step that uses this classifier to probabilistically label the
unlabeled documents for the first time.
The algorithm iterates until it converges to a point where θ̂ does not change
from one iteration to the next. Algorithmically, we determine that convergence
has occurred by observing a below-threshold change in the log-probability of the
parameters (Eq. 3.8), which is the height of the surface on which EM is hill-
climbing.
3.2.3 Discussion
The justifications for this approach depend on the assumptions stated in section 3.2,
namely, that the data are produced by a mixture model, and that there is a one-
to-one correspondence between mixture components and classes. If the generative
modeling assumptions were correct, then maximizing model probability would be
a good criterion indeed for training a classifier. In this case the Bayes optimal
classifier, when the number of training examples approaches infinity, corresponds
to the MAP parameter estimates of the model. When these assumptions do not
hold—as certainly is the case in real-world textual data—the benefits of unlabeled
data are less clear. With only labeled data, the naive Bayes classifier does a good job
of classifying text documents (Lewis and Ringuette, 1994; Craven et al., 2000; Yang
and Pedersen, 1997; Joachims, 1997; McCallum et al., 1998). This observation is
explained in part by the fact that classification estimation is only a function of the
sign (in binary classification) of the function estimation (Domingos and Pazzani,
1997; Friedman, 1997). The faulty word independence assumption exacerbates the
tendency of naive Bayes to produce extreme (almost 0 or 1) class probability
estimates. However, classification accuracy can be quite high even when these
estimates are inappropriately extreme.
Semi-supervised learning leans more heavily on the correctness of the modeling
assumptions than supervised learning. The next section will show empirically that
this method can indeed dramatically improve the accuracy of a document classifier,
especially when there are only a few labeled documents.
3.3 Experimental Results with Basic EM 41
100%
10000 unlabeled documents
90% No unlabeled documents
80%
70%
60%
Accuracy
50%
40%
30%
20%
10%
0%
10 20 50 100 200 500 1000 2000 5000
Number of Labeled Documents
Figure 3.1 Classification accuracy on the 20 Newsgroups data set, both with and without
10,000 unlabeled documents. With small amounts of training data, using EM yields more
accurate classifiers. With large amounts of labeled training data, accurate parameter
estimates can be obtained without the use of unlabeled data, and classification accuracies
of the two methods begin to converge.
100%
90%
80%
70%
60%
Accuracy
50%
40%
30%
20%
10%
0%
-1.69e+07 -1.68e+07 -1.67e+07 -1.66e+07 -1.65e+07 -1.64e+07
log Probability of Model
Figure 3.2 A scatterplot showing the correlation between the posterior model proba-
bility and the accuracy of a model trained with labeled and unlabeled data. The strong
correlation implies that model probability is a good optimization criteria for the 20 News-
groups data set.
vertical axis indicates average classifier accuracy on test sets, and the horizontal axis
indicates the amount of labeled training data on a log scale. We vary the amount of
labeled training data, and compare the classification accuracy of traditional naive
Bayes (no unlabeled documents) with an EM learner that has access to 10.000
unlabeled documents.
EM performs significantly better than traditional naive Bayes. For example,
with 300 labeled documents (15 documents per class), naive Bayes reaches 52%
accuracy while EM achieves 66%. This represents a 30% reduction in classification
error. Note that EM also performs well even with a very small number of labeled
documents; with only 20 documents (a single labeled document per class), naive
Bayes obtains 20%, EM 35%. As expected, when there are a lot of labeled data,
and the naive Bayes learning curve is close to a plateau, having unlabeled data
does not help nearly as much, because there are already enough labeled data to
accurately estimate the classifier parameters. With 5500 labeled documents (275
per class), classification accuracy increases from 76% to 78%. Each of these results
is statistically significant (p < 0.05).4
How does EM find more accurate classifiers? It does so by optimizing on posterior
model probability, not classification accuracy directly. If our generative model were
perfect, then we would expect model probability and accuracy to be correlated and
4. When the number of labeled examples is small, we have multiple trials, and use paired
t-tests. When the number of labeled examples is large, we have a single trial, and report
results instead with a McNemar test. These tests are discussed further by Dietterich (1998).
3.4 Using a More Expressive Generative Model 43
EM to be helpful. But we know that our simple generative model does not capture
many of the properties contained in the text. Our 20 Newsgroups results show that
we do not need a perfect model for EM to help text classification. Generative models
are representative enough for the purposes of text classification if model probability
and accuracy are correlated, allowing EM to indirectly optimize accuracy.
To illustrate this more definitively, let us look again at the 20 Newsgroups
experiments, and empirically measure this correlation. Figure 3.2 demonstrates the
correlation—each point in the scatterplot is one of the labeled and unlabeled splits
from figure 3.1. The labeled data here are used only for setting the EM initialization
and are not used during iterations. We plot classification performance as accuracy
on the test data and show the posterior model probability.
For this data set, classification accuracy and model probability are in good
correspondence. The correlation coefficient between accuracy and model probability
is 0.9798, a very strong correlation indeed. We can take this as a post hoc verification
that this data set is amenable to using unlabeled data via a generative model
approach. The optimization criterion of model probability is applicable here because
it is in tandem with accuracy.
The second assumption of the generative model of section 3.2 states that there
is a one-to-one correspondence between classes and components in the mixture
model. In some text domains, it is clear that such an assumption is a dangerous
one. Consider the task of text filtering, where we want to identify a small well-
defined class of documents from a very large pool or stream of documents. One
example might be a system that watches a network administrator’s incoming emails
to identify the rare emergency situation that would require paging her on vacation.
Modeling the nonemergency emails as the negative class with only one multinomial
distribution will result in an unrepresentative model. The negative class contains
emails with a variety of subtopics: personal emails, nonemergency requests, spam,
and many more.
What would be a more representative model? Instead of modeling a sea of
negative examples with a single mixture component, it might be better to model
it with many components. In this way, each negative component could, after
maximization, capture one clump of the sea of examples. This section takes exactly
the approach suggested by this example for text data, and relaxes the assumption of
a one-to-one correspondence between mixture components and classes. We replace it
with a less restrictive assumption: a many-to-one correspondence between mixture
components and classes. This allows us to model the subtopic structure of a class.
44 Semi-Supervised Text Classification Using EM
The new generative model must account for a many-to-one correspondence between
mixture components and classes. As in the old model, we first pick a class with a
biased die roll. Each class has several subtopics; we next pick one of these subtopics,
again with a biased die roll. Now that the subtopic is determined, the document’s
words are generated. We do this by first picking a length (independently of subtopic
and class) and then draw the words from the subtopic’s multinomial distribution.
Unlike previously, there are now two missing values for each unlabeled document—
its class and its subtopic. Even for the labeled data there are missing values; al-
though the class is known, its subtopic is not. Since we do not have access to
these missing class and subtopic labels, we must use a technique such as EM to
estimate local MAP generative parameters. As in section 3.2.2, EM is instantiated
as an iterative algorithm that alternates between estimating the values of missing
class and subtopic labels, and calculating the MAP parameters using the estimated
labels. After EM converges to high-probability parameter estimates the generative
model can be used for text classification by turning it around with the Bayes rule.
The new generative model specifies a separation between mixture components
and classes. Instead of using cj to denote both of these, cj ∈ [N ] now denotes only
the jth mixture component (subtopic). We write ta ∈ [M ] for the ath class; when
component cj belongs to class ta , then qaj = 1, and otherwise 0. This represents the
predetermined, deterministic, many-to-one mapping between mixture components
and classes. We indicate the class label and subtopic label of a document by yi and
zi , respectively. Thus if document xi was generated by mixture component cj we
say zi = cj , and if the document belongs to class ta , then we say yi = ta .
If all the class and subtopic labels were known for our data set, finding MAP
estimates for the generative parameters would be a straightforward application of
closed-form equations similar to those for naive Bayes seen in section 3.2.1. The
formula for the word probability parameters is identical to Eq. 3.5 for naive Bayes:
1 + xi ∈X δij xit
θ̂wt |cj ≡ P(wt |cj ; θ̂) = |X| . (3.9)
|X| + s=1 xi ∈X δij xis
The class probabilities are analogous to Eq. 3.6, but using the new notation for
classes instead of components:
1 + |X|
i=1 δia
θ̂ta ≡ P(ta |θ̂) = . (3.10)
M + |X|
The subtopic probabilities are similar, except they are estimated only with reference
to other documents in that component’s class:
3.4 Using a More Expressive Generative Model 45
|X|
1+ i=1 δij δia
θ̂cj |ta ≡ P(cj |ta ; θ̂) = N |X| . (3.11)
j=1 qaj + i=1 δia
a∈[M] qaj P(ta |θ̂)P(cj |ta ; θ̂) wt ∈X P(wt |cj ; θ̂)xit
P(zi = cj |xi ; θ̂) = . (3.12)
r∈[N ] b∈[M] qbr P(tb |θ̂)P(cr |tb ; θ̂) wt ∈X P(wt |cr ; θ̂)xit
Overall class membership is calculated with a sum of probability over all of the
class’s subtopics:
P(yi = ta |xi ; θ̂) = qaj P(zi = cj |xi ; θ̂). (3.13)
j∈[N ]
These equations for supervised learning are applicable only when all the training
documents have both class and subtopic labels. Without these we use EM. The
M step, as with basic EM, builds maximum a posteriori parameter estimates for
the multinomials and priors. This is done with Eqs. 3.9, 3.10, and 3.11, using the
probabilistic class and subtopic memberships estimated in the previous E step. In
the E step, for the unlabeled documents we calculate probabilistically weighted
subtopic and class memberships (Eqs. 3.12 and 3.13). For labeled documents, we
must estimate subtopic membership. But we know from its given class label that
many of the sub-topic memberships must be zero—those subtopics that belong to
other classes. Thus we calculate subtopic memberships as for the unlabeled data,
but setting the appropriate ones to zero, and normalizing the non-zero ones over
only those topics that belong to its class.
If we are given a set of class-labeled data, and a set of unlabeled data, we can now
apply EM if there is some specification of the number of subtopics for each class.
However, this information is not typically available. As a result we must resort to
some techniques for model selection. There are many commonly used approaches
to model selection such as cross-validation, Akaike information criterion (AIC),
bayesian information criterion (BIC) and others. Since we do have the availability
of a limited number of labeled documents, we use cross-validation to select the
number of subtopics for classification performance.
46 Semi-Supervised Text Classification Using EM
Table 3.1 Classification accuracy of binary classifiers on Reuters with traditional naive
Bayes (NB1), basic EM (EM1) with labeled and unlabeled data, multiple mixture compo-
nents using just labeled data (NB*), and multiple mixture components EM with labeled
and unlabeled data (EM*). For NB* and EM*, the number of components is selected
optimally for each trial, and the median number of components across the trials used for
the negative class is shown in parentheses. Note that the multicomponent model is more
natural for Reuters, where the negative class consists of many topics. Using both unlabeled
data and multiple mixture components per class increases performance over either alone,
and over naive Bayes.
Here, we provide empirical evidence that to use unlabeled data with a generative
modeling approach, more expressive generative models are sometimes necessary.
With the original generative model, classification accuracy and model probability
can be negatively correlated, leading to lower classification accuracy when unlabeled
data are used. With a more expressive generative model, a moderate positive
correlation is achieved, leading to improved classification accuracies.
The Reuters 21578 Distribution 1.0 data set consists of about 13,000 news articles
from the Reuters newswire labeled with 90 topic categories. Documents in this
data set have multiple class labels, and each category is traditionally evaluated
with a binary classifier. Following several other studies (Joachims, 1998; Liere and
Tadepalli, 1997) we build binary classifiers for each of the ten most populous classes
to identify the topic. We use a stoplist, but do not stem. The vocabulary size for
each Reuters trial is selected by optimizing accuracy as measured by leave-one-out
cross-validation on the labeled training set. The standard ModApte train/test split
is used, which is time-sensitive. Seven thousand of the 9603 documents available
for training are left unlabeled. From the remaining, we randomly select up to
ten nonoverlapping training sets of just ten positively labeled documents and 40
negatively labeled documents.
The first two columns of results in table 3.1 repeat the experiments of section 3.3
3.4 Using a More Expressive Generative Model 47
95% 100%
95%
Accuracy
Accuracy
90%
90%
85% 85%
-2.121e+06 -2.12e+06 -2.119e+06 -2.118e+06 -2.117e+06 -1.99e+06 -1.98e+06 -1.97e+06 -1.96e+06
log Probability of Model log Probability of Model
Figure 3.3 Scatterplots showing the relationship between model probability and classi-
fication accuracy for the Reuters acq task. On the left, with only one mixture component
for the negative class, probability and accuracy are inversely proportional, exactly what
we would not want. On the right, with ten mixture components for negative, there is a
moderate positive correlation between model probability and classification accuracy.
with basic EM on the Reuters data set. Here we see that for most categories,
classification accuracy decreases with the introduction of unlabeled data. For each
of the Reuters categories EM finds a significantly more probable model, given the
evidence of the labeled and unlabeled data. But frequently this more probable model
corresponds to a lower-accuracy classifier—not what we would hope for.
The first graph in figure 3.3 provides insight into why unlabeled data hurt. With
one mixture component per class, the correlation between classification accuracy
and model probability is very strong (r = −0.9906), but in the wrong direction!
Models with higher probability have significantly lower classification accuracy. By
examining the solutions found by EM, we find that the most probable clustering of
the data has one component with the majority of negative documents and the second
with most of the positive documents, but significantly more negative documents.
Thus, the classes do not separate with high-probability models.
The documents in this data set often have multiple class labels. With the basic
generative model, the negative class covers up to 89 distinct categories. Thus, it is
unreasonable to expect to capture such a broad base of text with a single mixture
component. For this reason, we relax the generative model and model the positive
class with a single mixture component and the negative class with between one and
forty mixture components, both with and without unlabeled data.
The second half of table 3.1 shows results of using multiple mixtures per class
generative model. Note two different results. First, with labeled data alone (NB*),
classification accuracy improves over the single component per class case (NB1).
Second, with unlabeled data, the new generative model results (EM*) are generally
better than the other results. This increase with unlabeled data, measured over all
trials of Reuters, is statistically significant (p < 0.05).
With ten mixture components the correlation between accuracy and model
probability is quite different. Figure 3.3 on the right shows the correlation between
48 Semi-Supervised Text Classification Using EM
Table 3.2 Performance of using multiple mixture components when the number of
components is selection via cross-validation (EM*CV) compared to the optimal selection
(EM*) and straight naive Bayes (NB1). Note that cross-validation usually selects too few
components.
accuracy and model probability when using ten mixture components to model the
negative class. Here, there is a moderate correlation between model probability
and classification accuracy in the right direction (r = 0.5474). For these solutions,
one component covers nearly all the positive documents and some, but not many,
negatives. The other ten components are distributed through the remaining negative
documents. This model is more representative of the data for our classification task
because classification accuracy and model probability are correlated. This allows
the beneficial use of unlabeled data through the generative model approach.
One obvious question is how to automatically select the best number of mixture
components without having access to the test set labels. We use leave-one-out cross-
validation. Results from this technique (EM*CV), compared to naive Bayes (NB1)
and the best EM (EM*), are shown in table 3.2. Note that cross-validation does
not perfectly select the number of components that perform best on the test set.
The results consistently show that selection by cross-validation chooses a smaller
number of components than is best.
3.4.3 Discussion
There is tension in this model selection process between complexity of the model
and data sparsity. With as many subtopics as there are documents, we can perfectly
model the training data—each subtopic covers one training document. With still a
large number of subtopics, we can accurately model existing data, but generalization
performance will be poor. This is because each multinomial will have its many
parameters estimated from only a few documents and will suffer from sparse
data. With very few subtopics, the opposite problem will arise. We will very
accurately estimate the multinomials, but the model will be overly restrictive,
3.5 Overcoming the Challenges of Local Maxima 49
In cases where the likelihood in parameter space is well correlated with classifi-
cation accuracy, our optimization yields good classifiers. However, local maxima
significantly hinder our progress. For example, the local maxima we discover with
just a few labeled examples in section 3.3 are more than 40 percentage points below
the classification accuracy provided when labeled data are plentiful. Thus it is im-
portant to consider alternative approaches that can help bridge this gap, especially
when labeled data are sparse.
Typically variants of, or alternatives to, EM are created for the purpose of
speeding up the rate of convergence (McLachlan and Krishnan, 1997). In the domain
of text classification, however, we have seen that convergence is very fast. Thus, we
can easily consider alternatives to EM that improve the local maxima situation
at the expense of slower convergence. Deterministic annealing makes exactly this
tradeoff.
l(θ|X, Y ) = log [P(cj |θ)P(xi |cj ; θ)]β
xi ∈Xu cj ∈[M]
+ log([P(yi = cj |θ)P(xi |yi = cj ; θ)]β ), (3.14)
xi ∈Xl
where β varies between zero and one. When β = 1 we have our familiar probability
surface of the previous sections, with good correlation to classification accuracy,
but with many harmful local maximum. In the limit as β approaches zero, the
surface value of the loss function in parameter space becomes convex with just
a single global maximum. But, at this extreme, the provided data have no effect
on the loss function, so the correlation with classification accuracy is poor. Values
between zero and one represent various points in the tradeoff between smoothness
of the parameter space and the similarity to the well-correlated probability surface
provided by the data.
This insight is the one that drives the approach called deterministic annealing
(Rose et al., 1992), first used as a way to construct a hierarchy during unsupervised
clustering. It has also been used to estimate the parameters of a mixture of
Gaussians from unlabeled data (Ueda and Nakano, 1995) and to construct a text
hierarchy from unlabeled data (Hofmann and Puzicha, 1998).
For a fixed value of β, we can find a local maximum given the loss function by
iterating the following steps:
E step: Calculate the expected value of the class assignments,
M step: Find the most likely model using the expected class assignments,
The M step is identical to that of section 3.2.2, while the E step includes reference
to the loss constraint through β.
Formally, β is a Lagrange multiplier when solving for a fixed loss in the likelihood
space subject to an optimization criterion of maximum entropy (or minimum
relative entropy to the prior distribution). A β near zero corresponds to finding
the maximum entropy parameterization for a model with a very large allowable
loss.
Consider how model likelihood (Eq. 3.14) is affected by different target losses.
3.5 Overcoming the Challenges of Local Maxima 51
When the target loss is very large, β will be very close to zero; the probability of
each model will very nearly be its prior probability as the influence of the data will
be negligible. In the limit as β goes to zero, the probability surface will be convex
with a single global maximum. For a somewhat smaller loss target, β will be small
but not negligible. Here, the probability of the data will have a stronger influence.
There will no longer be a single global maximum, but several. When β = 1 we have
our familiar probability surface of the previous chapters, with many local maxima.
These observations suggest an annealing-like process for finding a low-loss model.
If we initialize β to be very small, we can easily find the global maximum a posteriori
solution with EM, as the surface is convex. When we raise β the probability surface
will get slightly more bumpy and complex, as the data likelihood will have a larger
impact on the probability of the model. Although more complex, the new maximum
will be very close to the old maximum if we have lowered the temperature (1/β) only
slightly. Thus, when searching for the maximum with EM, we can initialize it with
the old maximum and will converge to a good maximum for the new probability
surface. In this way, we can gradually raise β, while tracking a highly probable
solution. Eventually, when β becomes 1, we will have a good local maximum for
our generative model assumptions. Thus, we will have found a high-probability local
maximum from labeled and unlabeled data that we can then use for classification.
Note that the computational cost of deterministic annealing is significantly higher
than EM. While each iteration takes the same computation, there are many more
iterations with deterministic annealing, as the temperature is reduced very slowly.
For example, in our experiments, we performed 390 iterations for deterministic
annealing, and only seven for EM. When this extra computation can be afforded,
the benefit may be more accurate classifiers.
In this section we see empirically that deterministic annealing finds more probable
parameters and more accurate classifiers than EM when labeled training data are
sparse.
For the experimental results, we use the News5 data set, a subset of 20 Newsgroups
containing the five confusable comp.* classes. We fix a single vocabulary for all
experiments as the top 4000 words as measured by mutual information over the
entire labeled data set. For running the deterministic annealing, we initialize β to
0.02, and at each iteration we increase β by a multiplicative factor of 1.01 until
β = 1. We made little effort to tune these parameters. Since each time we increase
β the probability surface changes only slightly, we run only one iteration of EM
at each temperature setting. Six hundred random documents per class (3000 total)
are treated as unlabeled. A fixed number of labeled examples per class are also
randomly selected. The remaining documents are used as a test set.
Figure 3.4 compares classification accuracy achieved with deterministic annealing
to that achieved by regular EM. The initial results indicate that the two methods
perform essentially the same when labeled data are plentiful, but deterministic an-
52 Semi-Supervised Text Classification Using EM
100%
90%
80%
70%
60%
Accuracy
50%
40%
0%
5 10 20 50 100 200 500 1000 2000
Number of Labeled Documents
nealing actually performs worse when labeled data are sparse. For example, with
two labeled examples per class (ten total) EM gives 58% accuracy where deter-
ministic annealing gives only 51%. A close investigation of the confusion matrices
shows that there is a significant detrimental effect of incorrect class-to-component
correspondence with deterministic annealing when labeled data are sparse. This
occurs because, when the temperature is very high, the global maximum will have
each multinomial mixture component very close to its prior, and the influence of
the labeled data is minimal. Since the priors are the same, each mixture component
will be essentially identical. As the temperature lowers and the mixture compo-
nents become more distinct, one component can easily track the cluster associated
with the wrong class, when there are insufficient labeled data to pull it toward the
correct class.
In an attempt to remedy this, we alter the class-to-cluster correspondence based
on the classification of each labeled example after deterministic annealing is com-
plete. Figure 3.4 shows both the accuracy obtained by empirically selected corre-
spondence, and also the optimal accuracy achieved by perfect correspondence. We
see that by empirically setting the correspondence, deterministic annealing improves
accuracy only marginally. Where before it got 51%, by changing the correspondence
we increase this to 55%, still not better than EM at 58%. However if we could per-
form perfect class correspondence, accuracy with deterministic annealing would be
67%, considerably higher than EM.
To verify that the higher accuracy of deterministic annealing comes from finding
more probable models, figure 3.5 shows a scatterplot of model probability versus
3.5 Overcoming the Challenges of Local Maxima 53
100%
90%
80%
70%
60%
Accuracy
50%
40%
10%
0%
log Probability of Model
Figure 3.5 A scatterplot comparing the model probabilities and accuracies of EM and
deterministic annealing. The results show that deterministic annealing succeeds because
it finds models with significantly higher probability.
accuracy for deterministic annealing (with optimal class assignment) and EM. Two
results of note stand out. The first is that indeed deterministic annealing finds much
more probable models, even with a small amount of labeled data. This accounts
for the added accuracy of deterministic annealing. A second note of interest is
that models found by deterministic annealing still lie along the same probability-
accuracy correlation line. This provides further evidence that model probability and
accuracy are strongly correlated for this data set, and that the correlation is not
just an artifact of EM.
3.5.3 Discussion
The experimental results show that deterministic annealing indeed could help clas-
sification considerably if class-to-component correspondence were solved. Determin-
istic annealing successfully avoids getting trapped in some poor local maxima and
instead finds more probable models. Since these high-probability models are cor-
related with high-accuracy classifiers, deterministic annealing makes good use of
unlabeled data for text classification.
The class-correspondence problem is most severe when there are only limited
labeled data. This is because with fewer labeled examples, it is more likely that
small perturbations can lead the correspondence astray. However, with just a
little bit of human knowledge, the class-correspondence problem can typically be
solved trivially. In all but the largest and most confusing classification tasks, it is
straightforward to identify a class given its most indicative words, as measured by
a metric such as the weighted log-likelihood ratio. For example, the top ten words
54 Semi-Supervised Text Classification Using EM
Table 3.3 The top ten words per class of the News5 data set, Usenet groups in the
comp hierarchy. The words are sorted by the weighted log-likelihood ratio. Note that from
just these ten top words, any person with domain knowledge could correctly correspond
clusters and classes.
graphics os.ms-windows.misc sys.ibm.pc.hardware sys.mac.hardware windows.x
jpeg windows scsi apple window
image ei ide mac widget
graphics win drive lc motif
images um controller duo xterm
gif dos bus nubus server
format ms dx fpu lib
pub ini bios centris entry
ray microsoft drives quadra openwindows
tiff nt mb iisi usr
siggraph el card powerbook sun
per class of our data set by this metric are shown in table 3.3. From just these ten
words, any person with even the slightest bit of domain knowledge would have no
problem perfectly assigning classes to components. Thus, it is not unreasonable to
require a small amount of human effort to correct the class correspondence after
deterministic annealing has finished. This effort can be positioned within the active
learning framework. Thus, when labeled training data are sparsest, and a modest
investment by a trainer is available to map class labels to cluster components,
deterministic annealing will successfully find more probable and more accurate
models than traditional EM.
Even when this limited domain knowledge or human effort is not available, it
should be possible to estimate the class correspondence automatically. One could
perform both EM and deterministic annealing on the data. Since EM solutions
generally have the correct class correspondence, this model could be used to fix the
correspondence of the deterministic annealing model. That is, one could measure the
distance between each EM class multinomial and each deterministic annealing class
multinomial (with Kullback-Leibler divergence, for example). Then, this matrix of
distances could be used to assign the class labels of the EM multinomials to their
closest match to a multinomial in the deterministic annealing model.
This chapter has explored the use of generative models for semi-supervised learn-
ing with labeled and unlabeled data in domains of text classification. The widely
used naive Bayes classifier for supervised learning defines a mixture of multino-
mials mixture models. In some domains, model likelihood and classification accu-
racy are strongly correlated, despite the overly simplified generative model. Here,
expectation-maximization finds more likely models and improved classification ac-
3.6 Conclusions and Summary 55
curacy. In other domains, likelihood and accuracy are not well correlated with the
naive Bayes model. Here, we can use a more expressive generative model that allows
for multiple mixture components per class. This helps restore a moderate correla-
tion between model likelihood and classification accuracy, and again, EM finds
more accurate models. Finally, even with a well-correlated generative model, local
maxima are a significant hindrance with EM. Here, the approach of deterministic
annealing does provide much higher likelihood models, but often loses the corre-
spondence with the class labels. When class label correspondence is easily corrected,
high accuracy models result.
4 Risks of Semi-Supervised Learning: How
Unlabeled Data Can Degrade Performance
of Generative Classifiers
Empirical and theoretical results have often testified favorably to the semi-
supervised learning of generative classifiers, as described in other chapters of this
book. However, the literature has also brought to light a number of situations
where semi-supervised learning fails to produce good generative classifiers. Here
some clarification is due. We are not simply concerned with classifiers that pro-
duce high classification error — this can also happen in supervised learning. Our
concern is this: it is frequently the case that we would be better off just discard-
ing the unlabeled data and employing a supervised method, rather than taking a
semi-supervised route. Thus we worry about the embarrassing situation where the
addition of unlabeled data degrades the performance of a classifier.
How can this be? Typically we do not expect to be better off by discarding data;
how can we understand this aspect of semi-supervised learning? In this chapter we
focus on the effect of modeling errors in semi-supervised learning, and show how
modeling errors can lead to performance degradation.
Venkatesh (1995).
The gist of these previous theoretical investigations is this. Suppose samples
(xi , yi ) are realizations of random variables Xv and Yv that are distributed according
to distribution p(Xv , Yv ). Suppose one learns a parametric model p(Xv , Yv |θ) such
that p(Xv , Yv |θ) is equal to p(Xv , Yv ) for some value of θ — that is, the “model is
positive results: correct” in the sense that it can exactly represent p(Xv , Yv ).1 Then one is assured
“correct” model to have an expected reduction in classification error as more and more data are
collected (labeled or unlabeled). Moreover, labeled data are exponentially more
effective in reducing classification error than unlabeled data. In these optimistic
results, unlabeled data can be profitably used whenever available.
However, a more detailed analysis of current empirical results does reveal some
examples of puzzling aspects of unlabeled data. For example, Shahshahani and Landgrebe
performance (1994) report experiments where unlabeled data degraded the performance of naive
degradation Bayes classifiers with Gaussian variables. They attribute such cases to deviations
from modeling assumptions, such as outliers and “samples of unknown classes”
— they even suggest that unlabeled samples should be used with care, and only
when the labeled data alone produce a poor classifier. Another representative
example is the work by Nigam et al. (2000) on text classification, where classifiers
sometimes display performance degradation. They suggest several possible sources
of difficulties: numerical problems in the learning algorithm, mismatches between
the natural clusters in feature space and the actual labels. Additional examples
are easy to find. Baluja (1999) used naive Bayes and tree-augmented naive Bayes
(TAN) classifiers (Friedman et al., 1997) to detect faces in images, but there were
cases where unlabeled data degraded performance. Bruce (2001) used labeled and
unlabeled data to learn Bayesian network classifiers, from naive Bayes classifiers
to fully connected networks; the naive Bayes classifiers displayed bad classification
performance, and in fact the performance degraded as more unlabeled data were
used (more complex networks also displayed performance degradation as unlabeled
samples were added). A final example: Grandvalet and Bengio (2004) describe
experiments where outliers are added to a Gaussian model, causing generative
classifiers to degrade with unlabeled data.
Figure 4.1 shows a number of experiments that corroborate this anecdotal
evidence. All of them involve binary classification with categorical variables; in all
of them Xv is actually a vector containing several attributes Xvi . In all experiments
the generative classifiers were learned by maximum likelihood using the expectation-
maximization (EM) algorithm (chapters 2, 3). Figure 4.1(a) shows the performance
of naive Bayes classifiers learned with increasing amounts of unlabeled data (for
several fixed amounts of labeled data), where the data are distributed according to
naive Bayes assumptions. That is, the data were generated by randomly generated
1. Note that here and in the remainder of the chapter we employ p to denote distributions
and densities (for discrete/continuous variables using appropriate measures); we indicate
the type of object we deal with whenever it is not clear from the context.
4.2 Understanding Unlabeled Data: Asymptotic Bias 59
statistical models that comply with the independence assumptions of naive Bayes
classifiers. In the naive Bayes model, all attributes Xv are independent of each
other given the class Yv : p(Xv , Yv ) = p(Yv ) p(Xvi ). The result is simple: the more
unlabeled data, the better. Figure 4.1(b) shows an entirely different picture. Here
a series of naive Bayes classifiers were learned with data distributed according to
TAN assumptions: each attribute is directly dependent on the class and on at most
another attribute — the attributes form a “tree” of dependencies, hence the name
tree-augmented naive Bayes (Friedman et al., 1997). That is, in figure 4.1(b) the
“model is incorrect.” The graphs in figure 4.1(b) indicate performance degradation
with increasing amounts of unlabeled data.
Figure 4.1(c) depicts a more complex scenario. Again a series of naive Bayes
classifiers were learned with data distributed according to TAN assumptions, so
the “model is incorrect.” Note that two of the graphs show a trend of decreasing
error (as the number of unlabeled samples increases), while the other graph shows a
trend of increasing error. Here unlabeled data improve performance in the presence
of a few labeled samples, but unlabeled data degrade performance when added to
a larger number of labeled samples. A larger set of experiments with artificial data
is described by Cozman and Cohen (2002).
Figure 4.1(d) shows the result of learning naive Bayes classifiers using different
combinations of labeled and unlabeled data sets for the adult classification problem
(using the training and testing data sets available in the UCI repository 2 ). We see
that adding unlabeled data can improve classification when the labeled data set is
small (30 labeled data), but degrade performance as the labeled data set becomes
larger. Thus the properties of this real data set lead to behavior similar to figure
4.1(c).
Finally, figure 4.1(e) and 4.1(f) shows the result of learning naive Bayes and TAN
classifiers using data set 8 in the benchmark data (chapter 21). Both show similar
trends as those displayed in previous graphs.
We can summarize the previous section as follows. First, there are results that
guarantee benefits from unlabeled data when the learned generative classifier
is based on a “correct” model. Second, there is strong empirical evidence that
unlabeled data may degrade performance of classifiers. Performance degradation
may occur whenever the modeling assumptions adopted for a particular classifier
do not match the characteristics of the distribution generating the data.3 This is
2. ftp://ftp.ics.uci.edu/pub/machine-learning-databases/adult
3. As we show in this and subsequent sections, performance degradation occurs even in
the absence of numerical errors or existence of local optima for parameter estimation.
In fact our presentation is independent of numerical techniques, so that results are not
clouded by the intricacies of numerical analysis.
60 Risks of Semi-Supervised Learning
0.13 0.4
0.12
30 Labeled 0.35
Probability of error
Probability of error
0.11
0.3
0.1
30 Labeled
0.09 0.25
(a) (b)
0.1
0.5
0.09
0.45
0.08
Probability of error
Probability of error
0.4 30 Labeled
0.07
0.35
0.06
300 Labeled
0.3
0.05
0.25
0.04 300 Labeled
0.2
0.03 3000 Labeled 3000 Labeled
0 1 2 3 4 0 1 2 3 4
10 10 10 10 10 10 10 10 10 10
Number of Unlabeled records Number of Unlabeled records
(c) (d)
(e) (f)
Figure 4.1 (a) Naive Bayes classifiers learned from data distributed according to naive
Bayes assumptions with ten attributes; attributes with two to four values. (b) Naive Bayes
classifiers learned from data distributed according to TAN assumptions with ten attributes.
(c) Naive Bayes classifiers learned from data distributed according to TAN assumptions
with 49 attributes. (d) Naive Bayes classifiers generated from the adult database. (e) Naive
Bayes classifiers generated from the data set SecStr, benchmark data (chapter 21). (f)
TAN classifiers generated from the data set SecStr, benchmark data (chapter 21). In all
graphs, points summarize ten runs of each classifier on testing data (bars cover 30th to
70th percentiles).
4.2 Understanding Unlabeled Data: Asymptotic Bias 61
Note that from the above distribution we can compute the probabilities of W given
62 Risks of Semi-Supervised Learning
G to get
To classify the baby’s gender given weight gain and chocolate craving, we compute
the a posteriori probability of G given W and Ch (which, from the independence
stated above, depends only on Ch):
From the a posteriori probabilities, the optimal classification rule (the Bayes rule,
Bayes rule discussed in the next section) is
The Bayes error rate (i.e., the probability of error under the Bayes rule) for this
problem can be easily computed and found to be at about 15%.
Suppose that we incorrectly assume a naive Bayes model for the problem; that is,
assuming naive we assume that dependencies are expressed by the graph Ch ← G → W . Thus we
Bayes incorrectly assume that weight gain is independent of chocolate craving given the
gender; thus we incorrectly assume that the factorization of the joint probability
distribution can be written as P(G, Ch, W ) = P(G)P(Ch|G)P(W |G). Suppose that
a friend gave us the “true” values of P(Ch|G), so we do not have to estimate
these quantities. We wish to estimate P(G) and P(W |G) using maximum-likelihood
techniques.
In the case where only labeled data are available, estimators are obtained by
relative frequencies, with zero bias and variance inversely proportional to the
size of the database. Thus even a relatively small database will produce excellent
estimates of probability values. The estimate for P(G) will most likely be close to
0.5; likewise, estimates of P(W = Less|G = Girl) will be close to 0.6 and estimates
of P(W = Less|G = Boy) will be close to 0.25. With these estimated parameters
and the assumed decomposition of the joint probability distribution, the a posteriori
probabilities for G will likely be close to
the “labeled” maximum a posteriori value of G. Even though the bias from the “true” a posteriori
classifier probabilities is not zero, this will produce the same optimal Bayes rule 4.1; that is,
the “labeled” classifier is likely to yield the minimum classification error.
Now suppose that unlabeled data are available. As more and more unlabeled
samples are collected, the ratio between the number of labeled samples and the
total number of samples goes to zero. In section 4.3 we show how to compute the
asymptotic estimates in this case. The computation, which is performed in closed
form for this case, yields the following asymptotic estimates: P(G = Boy) = 0.5,
P(W = Less|G = Girl) = 0.78, P(W = Less|G = Boy) = 0.07. The a posteriori
probabilities for G will therefore tend to
Here we see that the prediction has changed from the optimal in the case {Ch =
the “unlabeled” Yes, W = Less}; instead of predicting {G = Boy}, we predict {G = Girl}. We can
classifier easily find the expected error rate to be at 22%, an increase of 7% in error.
What happened? The labeled data take us to a particular asymptotic limit, and
the unlabeled data take us to a distinct limit. In section 4.3 we will see that this
transition is smooth as unlabeled samples are collected. Because the latter limit is
worse (from the point of view of classification) than the former, the gradual addition
of unlabeled samples degrades performance.
Consider again figure 4.1(a). The graphs there illustrate the situation where
the “model is correct”: labeled and unlabeled data lead to identical asymptotic
estimates. The other graphs in figure 4.1 illustrate situations where the “model is
incorrect.” In these cases the asymptotic estimates tend to the “unlabeled” classifier
as more and more unlabeled data are available — depending on the amount of
labeled data, the graphs start above or below this “unlabeled” limit.
arg max λEp(Xv ,Yv ) [log p(Xv , Yv |θ)] + (1 − λ)Ep(Xv ,Yv ) [log p(Xv |θ)] . (4.2)
θ
where p(Xv ) is a mixture density. Accordingly, the parametric model adopted for
(Xv , Ỹv ) has the same form:
p̃(Xv , Ỹv = y|θ) = (λp(Xv , Yv = y|θ))I{Ỹv =0} (y) ((1 − λ)p(Xv |θ))I{Ỹv =0} (y) .
The value θ∗ that maximizes Ep̃(Xv ,Ỹv ) log p̃(Xv , Ỹv |θ) is
arg max Ep̃(Xv ,Ỹv ) I{Ỹv =0} (Ỹv ) (log λp(Xv , Yv |θ)) + I{Ỹv =0} (Ỹv ) (log(1 − λ)p(Xv |θ)) .
θ
Hence θ∗ maximizes
β + Ep̃(Xv ,Ỹv ) I{Ỹv =0} (Ỹv ) log p(Xv , Yv |θ) + Ep̃(Xv ,Ỹv ) I{Ỹv =0} (Ỹv ) log p(Xv |θ) ,
where β = λ log λ+(1−λ) log(1−λ). As β does not depend on θ, we must only max-
imize the last two terms, which are equal to λEp̃(Xv ,Ỹv ) log p(Xv , Yv |θ)|Ỹv = 0 +
(1−λ)Ep̃(Xv ,Ỹv ) log p(Xv |θ)|Ỹv = 0 . As we have p̃(Xv , Ỹv |Ỹv = 0) = p(Xv , Yv ) and
p̃(Xv |Ỹv = 0) = p(Xv ), the last expression is equal to λEp(Xv ,Yv ) [log p(Xv , Yv |θ)] +
(1 − λ)Ep(Xv ,Yv ) [log p(Xv |θ)]. Thus we obtain expression 4.2.
Results by White (1982) can also be adapted to the context of semi-supervised
learning to prove that generally the variance of estimates decreases with increasing
n. The asymptotic variance depends on the inverse of the Fisher information; the
Fisher information is typically larger for larger proportions of labeled data (Castelli,
1994; Castelli and Cover, 1995, 1996).
Expression 4.2 indicates that the objective function in semi-supervised learn-
semi-supervised ing can be viewed asymptotically as a “convex” combination of objective func-
learning as tions for supervised learning (E [log p(Xv , Yv |θ)]) and for unsupervised learning
“convex” (E [log p(Xv |θ)]). Denote by θλ∗ the value of θ that maximizes expression 4.2 for
combination
66 Risks of Semi-Supervised Learning
a given λ. Denote by θl∗ the “labeled” limit θ1∗ and by θu∗ the “unlabeled” limit
θ0∗ .5 We note that, with a few additional assumptions on the modeling densities,
theorem 4.1 and the implicit function theorem can be used to prove that θλ∗ is a con-
tinuous function of λ — that is, the “path” followed by the solution is a continuous
one.
We can now present more formal versions of the arguments sketched in section 4.2.
model is correct Suppose first that the family of distributions p(Xv , Yv |θ) contains the distribution
p(Xv , Yv ); that is, p(Xv , Yv |θ⊤ ) = p(Xv , Yv ) for some θ⊤ , so the “model is correct.”
When such a condition is satisfied, θl∗ = θu∗ = θ⊤ given identifiability, and then
θλ∗ = θ⊤ , for any 0 < λ ≤ 1, is a maximum-likelihood estimate. In this case,
maximum likelihood is consistent, the asymptotic bias is zero, and classification
error converges to the Bayes error. As variance decreases with increasing numbers
of labeled and unlabeled data, the addition of both kinds of data eventually reaches
the “correct” distribution and the Bayes error.
We now study the scenario that is more relevant to our purposes, where the
distribution p(Xv , Yv ) does not belong to the family of distributions p(Xv , Yv |θ).
model is incorrect Denote by e(θ) the classification error with parameter θ, and suppose e(θu∗ ) > e(θl∗ )
(as in the Boy-Girl example and in the other examples presented later). If we observe
a large number of labeled samples, the classification error is approximately e(θ l∗ ).
If we then collect more samples, most of which are unlabeled, we eventually reach
a point where the classification error approaches e(θu∗ ). So, the net result is that
we started with a classification error close to e(θl∗ ), and by adding a great number
of unlabeled samples, classification performance degraded towards e(θu∗ ). A labeled
data set can be dwarfed by a much larger unlabeled data set: the classification
error using the whole data set can be larger than the classification error using only
labeled data.
summary To summarize, we have the following conclusions. First, labeled and unlabeled
data contribute to a reduction in variance in semi-supervised learning under
maximum-likelihood estimation. Second, when the model is “correct,” maximum-
likelihood methods are asymptotically unbiased both with labeled and unlabeled
data. Third, when the model is “incorrect,” there may be different asymptotic bi-
ases for different values of λ. Asymptotic classification error may also vary with λ
— an increase in the number of unlabeled samples may lead to a larger estimation
asymptotic bias and to a larger classification error. If the performance obtained
with a given set of labeled data is better than the performance with infinitely many
unlabeled samples, then at some point the addition of unlabeled data must decrease
performance.
5. We have to handle a difficulty with the classification error for θu∗ : given only unlabeled
data, there is no information to decide the labels for decision regions, and the classification
error is 1/2 (Castelli, 1994). Thus we always reason with λ → 0 instead of λ = 0.
4.4 The Value of Labeled and Unlabeled Data 67
The previous discussion alluded to the possibility that e(θu∗ ) > e(θl∗ ) when the model
is “incorrect.” To understand a few important details about this phenomenon,
consider another example.
Gaussian Suppose we have attributes Xv1 and Xv2 from two classes −1 and +1. We know
example that (Xv1 , Xv2 ) is a Gaussian vector with mean (0, 3/2) conditional on {Yv = −1},
and mean (3/2, 0) conditional on {Yv = +1}; variances for Xv1 and for Xv2
conditional on Yv are equal to 1. We believe that Xv1 and Xv2 are independent
given Yv , but actually Xv1 and Xv2 are dependent conditional on {Yv = +1}:
the correlation ρ = E [(Xv1 − E [Xv1 |Yv = +1])(Xv2 − E [Xv2 |Yv = +1])|Yv = +1]
is equal to 4/5 (Xv1 and Xv2 are independent conditional on {Yv = −1}). Data
are sampled from a distribution such that η = P(Yv = −1) = 3/5, but we do not
know this probability. If we knew the value of ρ and η, we would easily compute the
optimal classification boundary on the plane Xv1 × Xv2 (this optimal classification
boundary is quadratic). By mistakenly assuming that ρ is zero we are generating a
naive Bayes classifier that approximates P(Yv |Xv1 , Xv2 ).
Under the incorrect assumption that ρ = 0, the “optimal” classification boundary
is linear: xv2 = xv1 + 2 log((1 − η̂)/η̂)/3. With labeled data we can easily obtain
η̂ (a sequence of Bernoulli trials); then ηl∗ = 3/5 and the classification boundary
the “labeled” is given by xv2 = xv1 − 0.27031. Note that this (linear) boundary obtained with
classifier labeled data and the generative naive Bayes classifier assumption is not the best
possible linear boundary minimizing the classification error. We can in fact find the
best possible linear boundary of the form xv2 = xv1 + γ. The classification error
can be written as a function of γ that has positive second derivative; consequently
the function has a single minimum that can be found numerically (the minimizing
the best linear γ is −0.45786). If we consider the set of lines of the form xv2 = xv1 + γ, we see that
classifier the farther we go from the best line, the larger the classification error. Figure 4.2
shows the linear boundary obtained with labeled data and the best possible linear
boundary. The boundary from labeled data is “above” the best linear boundary.
Now consider the computation of ηu∗ , the asymptotic estimate with unlabeled
data. By theorem 4.1, we must obtain:
∞ ∞
arg max g0 (xv1 , xv2 ) log(ηg1 (xv1 , xv2 ) + (1 − η)g3(xv1 , xv2 ))dxv2 dxv1 ,
η∈[0,1] −∞ −∞
where
4
4
3
2
2
1
0
0
-1
-2 -2
-3
-3 -2 -1 0 1 2 3 4 -3 -2 -1 0 1 2 3 4
Figure 4.2 Graphs for the Gaussian example. On the left, contour plots of the mixture
p(Xv1 , Xv2 ), the optimal classification boundary (quadratic curve), and the best possible
classification boundary of the form xv2 = xv1 + γ. On the right, the same contour plots,
and the best linear boundary (lower line), the linear boundary obtained from labeled data
(middle line), and the linear boundary obtained from unlabeled data (upper line).
The second derivative of this double integral is always negative (as can be seen by
interchanging differentiation with integration), so the function is concave and there
is a single maximum. We can search for the zero of the derivative of the double
the “unlabeled” integral with respect to η. We obtain this value numerically, ηu∗ = 0.54495. Using
classifier this estimate, the linear boundary from unlabeled data is xv2 = xv1 − 0.12019.
This line is “above” the linear boundary from labeled data, and, given the previous
discussion, leads to a larger classification error than the boundary from labeled
data. The boundary obtained from unlabeled data is also shown in figure 4.2. The
classification error for the best linear boundary is 0.06975, while e(ηl∗ ) = 0.07356
and e(ηu∗ ) = 0.08141.
This example suggests the following situation. Suppose we collect a large number
l of labeled samples from P(Yv , Xv1 , Xv2 ), with η = 3/5 and ρ = 4/5. The labeled
estimates form a sequence of Bernoulli trials with probability 3/5, so the estimates
quickly approach ηl∗ (the variance of η̂ decreases as 6/(25l)). If we then add a very
large amount of unlabeled data to our data, η̂ approaches ηu∗ and the classification
error increases.
changing η and ρ By changing the values of η and ρ, we can produce other interesting situations. For
example, if η = 3/5 and ρ = −4/5, the best linear boundary is xv2 = xv1 − 0.37199,
the boundary from labeled data is xv2 = xv1 − 0.27031, and the boundary from
unlabeled data is xv2 = xv1 − 0.34532; the latter boundary is “between” the other
two — additional unlabeled data lead to improvement in classification performance!
As another example, if η = 3/5 and ρ = −1/5, the best linear boundary is
xv2 = xv1 − 0.29044, the boundary from labeled data is xv2 = xv1 − 0.27031,
and the boundary from unlabeled data is xv2 = xv1 − 0.29371. The best linear
boundary is “between” the other two. In this case we attain the best possible linear
boundary by mixing labeled and unlabeled data with λ = 0.08075.
4.5 Finite Sample Effects 69
We have so far found that taking larger and larger amounts of unlabeled data
changes not only the variance of estimates but also their average behavior. The
Gaussian example shows that we cannot always expect labeled data to produce a
better classifier than the unlabeled data. Still, one would intuitively expect labeled
data to provide more guidance to a learning procedure than unlabeled data. Is there
anything that can be said about the (intuitively plausible and empirically visible)
more valuable status of labeled data?
“labeled” limit One informal argument is this. Suppose we have an estimate θ̂. It is typically the
better than the case that the smaller the value of the expected Kullback-Leibler divergence between
“unlabeled” one? p(Yv |Xv ) and p(Yv |Xv , θ̂), the smaller the classification error, where the Kullback-
Leibler divergence is EKL(θ) = E [log(p(Yv |Xv )/p(Yv |Xv , θ)] (Garg and Roth,
2001; Cover and Thomas, 1991). Direct minimization of expected Kullback-Leibler
divergence yields EKL(θt∗ ) where θt∗ = arg maxθ E [log p(Yv |Xv , θ)]. Now unlabeled
data asymptotically yield EKL(θu∗ ) where θu∗ = arg maxθ E [log p(Xv |θ)], and la-
beled data asymptotically yield EKL(θl∗ ) where θl∗ = arg maxθ E [log p(Yv |Xv , θ)]+
E [log p(Xv |θ)]. Note the following pattern. We are interested in minimizing
E [log p(Yv |Xv , θ)]. While labeled data allow us to minimize a combination of
this quantity plus E [log p(Xv |θ)], unlabeled data only allow us to minimize
E [log p(Xv |θ)]. When the “model is incorrect,” this last quantity may in fact be
far from the “true” E [log p(Xv )], and we may be getting less help from unlabeled
data than we might get from labeled data. This informal argument seems to be at
the core of the perception that labeled data should be more valuable than unla-
beled data when the “model is incorrect.” The analysis presented in this chapter
adds to this perception the following comment: by trying to (asymptotically) min-
imize an expected value E [log p(Xv )|θ] that may even be unrelated to the “true”
E [log p(Xv )], we may in fact be led astray by the unlabeled data.
Asymptotic analysis can provide insight into complex phenomena, but finite sample
effects are also important. In practice one may have very little labeled data, and the
estimates θ̂ from labeled data may be so poor that the addition of unlabeled data is
a positive move. This can be explained as follows. A small number of labeled samples
may lead to estimators with high variance, thus likely to yield high classification
error (Friedman, 1997). In those circumstances the inclusion of unlabeled data may
lead to a substantial decrease in variance and a decrease in classification error, even
as the bias is negatively affected by the unlabeled data.
In general, the more parameters one has to estimate, the larger the variance of
estimators for the same amount of data. If we have a classifier with a large number
of attributes and we have only a few labeled samples, the variance of estimators is
likely to be large, and classification performance is likely to be poor — the addition
many attributes of unlabeled data is then a reasonable action to take. Consider again figure 4.1(c).
Here we have a naive Bayes classifier with 49 attributes. If we have a relatively large
70 Risks of Semi-Supervised Learning
amount of labeled data, we start close to the “labeled” limit e(θl∗ ), and then we
observe performance degradation as we move toward e(θu∗ ). However, if we have few
labeled samples, we start with very poor performance, and we decrease classification
error by moving toward e(θu∗ ).
We note that text classification is an important problem where many attributes
text classification are often available (often thousands of attributes), and where generative semi-
supervised learning has been successful (Nigam et al., 2000).
labeled data degrades performance, then there is clear indication that modeling
detecting assumptions are incorrect. In fact one can test whether differences in performance
incorrect models are statistically significant, using results by O’Neill (1978); once one finds that a
particular set of modeling assumptions is flawed, a healthy process of model revision
may be started. In fact, one might argue that model search/revision should always
be an important component in the tool set of semi-supervised learning (Cozman
et al., 2003a).
4.7 Conclusion
Given the possibility of performance degradation, it seems that some care must be
taken in generative semi-supervised learning. Statements that are intuitively and
provably true when models are “correct” may fail (sometimes miserably!) when
models are “incorrect.” Apparently mild modeling errors may cause unlabeled data
to degrade performance, even in the absence of numerical errors, and even in sit-
uations where more labeled data would be beneficial. Examples of performance
degradation from outliers and other common modeling errors can be easily con-
cocted (Cozman et al., 2003b).
In the absence of modeling errors, labeled data differ from unlabeled data only
on the “information they carry about the decisions associated with the decision
regions” (Castelli and Cover, 1995). However, as we consider the possibility of
modeling errors, labeled data and unlabeled data also differ in the biases they induce
on estimates. The analysis in sections 4.2, 4.3, and 4.4 focused on asymptotic bias,
a strategy that avoids distractions from finite sample effects and numerical errors.
However, we note that finite sample effects may be important in practice, as we
discuss in section 4.5.
At this point it is perhaps useful to add a few comments of methodological
methodology character. Given a pool of labeled and unlabeled data, generative semi-supervised
learning is an attractive strategy. However, one should always start by learning a
supervised classifier with the labeled data. This “baseline” classifier can then be
compared to other semi-supervised classifiers through cross-validation or similar
techniques. Whenever modeling assumptions seem inaccurate, unlabeled data can
be used to test modeling assumptions. If time and resources are available, a model
search should be conducted, attempting to reach a “correct” model — that is, a
model where unlabeled data will be truly beneficial. Techniques discussed in section
4.6 can be employed in this setting. An additional step is to compare the baseline
classifier to nongenerative methods. There are many semi-supervised nongenerative
classifiers, as discussed in other chapters of this book. There are also a significant
number of methods that use labeled and unlabeled data for different purposes — for
example, methods where the unlabeled data are used only to conduct dimensionality
reduction (chapter 12). However we should warn that a few empirical results in the
literature suggest the possibility of performance degradation in nongenerative semi-
supervised learning paradigms, such as transductive support vector machine (SVM)
72 Risks of Semi-Supervised Learning
5.1 Introduction
semi-supervised This chapter focuses on semi-supervised clustering with constraints, the problem of
clustering with partitioning a set of data points into a specified number of clusters when limited
constraints supervision is provided in the form of pairwise constraints. While clustering is
traditionally considered to be a form of unsupervised learning since no class labels
are given, inclusion of pairwise constraints makes it a semi-supervised learning task,
where the performance of unsupervised clustering algorithms can be improved using
the limited training data.
must-link and Pairwise supervision is typically provided as must-link and cannot-link constraints
cannot-link on data points: a must-link constraint indicates that both points in the pair should
constraints be placed in the same cluster, while a cannot-link constraint indicates that two
points in the pair should belong to different clusters. Alternatively, must-link
and cannot-link constraints are sometimes called equivalence and nonequivalence
constraints respectively. Typically, the constraints are “soft”, that is, clusterings
that violate them are undesirable but not prohibited.
In certain applications, supervision in the form of class labels may be unavailable,
while pairwise constraints are easily obtained, creating the need for methods that
exploit such supervision. For example, complete class labels may be unknown in
the context of clustering for speaker identification in a conversation (Bar-Hillel
et al., 2003), or clustering GPS data for lane-finding (Wagstaff et al., 2001). In
some domains, pairwise constraints occur naturally, e.g., the database of interacting
proteins (DIP) data set in biology contains information about proteins co-occurring
in processes, which can be viewed as must-link constraints during clustering.
Moreover, in an interactive learning setting, a user who is not a domain expert can
sometimes provide feedback in the form of must-link and cannot-link constraints
more easily than class labels, since providing constraints does not require the user
to have significant prior knowledge about the categories in the data set.
Proposed methods for semi-supervised clustering fall into two general categories
constraint-based that we call constraint-based and distance-based. Constraint-based methods use the
and provided supervision to guide the algorithm toward a data partitioning that avoids
distance-based violating the constraints (Demiriz et al., 1999; Wagstaff et al., 2001; Basu et al.,
methods 2002). In distance-based approaches, an existing clustering algorithm that uses
a particular distance function between points is employed; however, the distance
function is parameterized and the parameter values are learned to bring must-linked
points together and take cannot-linked points further apart (Bilenko and Mooney,
2003; Cohn et al., 2003; Klein et al., 2002; Xing et al., 2003).
This chapter describes an approach to semi-supervised clustering based on hidden
Markov random fields (HMRFs) that combines the constraint-based and distance-
based approaches in a unified probabilistic model. The probabilistic formulation
leads to a clustering objective function derived from the joint probability of ob-
served data points, their cluster assignments, and generative model parameters.
This objective function can be optimized using an expectation-maximimzation
5.2 HMRF Model for Semi-Supervised Clustering 75
is to partition the data points X into K disjoint clusters (X1 , . . . , XK ) so that the
total distortion between the points and the corresponding cluster representatives is
minimized according to the given distortion measure dA , while constraint violations
are kept to a minimum.
The HMRF probabilistic framework (Zhang et al., 2001) for semi-supervised con-
strained clustering consists of the following components:
An observable set X = (x1 , . . . , xn ) corresponding to the given data points X.
Note that we overload notation and use X to refer to both the given set of data
points and their corresponding random variables.
An unobservable (hidden) set Y = (y1 , . . . , yn ) corresponding to cluster assign-
ments of points in X. Each hidden variable yi encodes the cluster label of the point
xi and takes values from the set of cluster indices (1, . . . , K).
An unobservable (hidden) set of generative model parameters Θ, which consists
of distortion measure parameters A and cluster representatives M = (μ1 , . . . , μK ):
Θ = {A, M }.
An observable set of constraint variables C = (c12 , c13 , . . . , cn−1,n ). Each cij
is a tertiary variable taking on a value from the set (−1, 0, 1), where cij = 1
indicates that (xi , xj ) ∈ CML , cij = −1 indicates that (xi , xj ) ∈ CCL , and cij = 0
corresponds to pairs (xi , xj ) that are not constrained.
Since constraints are fully observed and the described model does not attempt
to model them generatively, the joint probability of X, Y , and Θ is conditioned on
the constraints encoded by C.
HMRF example Figure 5.1 shows a simple example of an HMRF. X consists of five data points
with corresponding variables (x1 , . . . , x5 ) that have cluster labels Y = (y1 , . . . , y5 ),
which may each take on values (1, 2, 3) denoting the three clusters. Three pairwise
constraints are provided: two must-link constraints (x1 , x2 ) and (x1 , x4 ), and one
cannot-link constraint (x2 , x3 ). Corresponding constraint variables are c12 = 1,
c14 = 1, and c23 = −1; all other variables in C are set to zero. The task is to
partition the five points into three clusters. Figure 5.1 demonstrates one possible
clustering configuration which does not violate any constraints. The must-linked
points x1 , x2 , and x4 belong to cluster 1; the point x3 , which is cannot-linked with
x2 , is assigned to cluster 2; x5 , which is not involved in any constraints, belongs to
cluster 3.
Hidden MRF
y1 = 1 y2 = 1
Must-link (c12 = 1)
Cannot−link
(c23 = −1)
Must-link (c14 = 1)
y5 = 3
y4 = 1 y3 = 2
x1 x2
x4 x5 x3
Observed data
CCL }. The resulting random field defined over the hidden variables Y is a Markov
random field (MRF), where the conditional probability distribution over the hidden
Markov field over variables obeys the Markov property:
labels
Thus the conditional probability of yi for each xi , given the model parameters and
the set of constraints, depends only on the cluster labels of the observed variables
that are must-linked or cannot-linked to xi . Then, by the Hammersley-Clifford
theorem (Hammersley and Clifford, 1971), the prior probability of a particular
label configuration Y can be expressed as a Gibbs distribution (Geman and Geman,
1984), so that
1 1
P(Y |Θ, C) = exp (−v(Y )) = exp − vNi (Y ) , (5.2)
Z Z
Ni ∈N
C Θ
Y X
constraint where each constraint potential function v(i, j) has the following form:
potential function
⎧
⎪
⎨ wij fML (i, j) if cij = 1 and yi = yj ,
⎪
v(i, j) = wij fCL (i, j) if cij = −1 and yi = yj , (5.4)
⎪
⎪
⎩ 0 otherwise.
The penalty functions fML and fCL encode the lowered probability of observing
configurations of Y where constraints encoded by C are violated. To this end,
function fML penalizes violated must-link constraints and function fCL penalizes
violated cannot-link constraints. These functions are chosen to correspond with the
distortion measure by employing same model parameters Θ, and will be described
in detail in section 5.3. Overall, this formulation for observing the label assignment
Y results in higher probabilities being assigned to configurations in which cluster
assignments do not violate the provided constraints.
The joint probability of X, Y , and Θ, given C, in the described HMRF model can
be factorized as follows:
The graphical plate model (Buntine, 1994) of the dependence between the random
graphical plate variables in the HMRF is shown in figure 5.2, where the unshaded nodes represent
model the hidden variables, the shaded nodes are the observed variables, the directed links
show dependencies between the variables, while the lack of an edge between two
5.2 HMRF Model for Semi-Supervised Clustering 79
n
P(X|Y, Θ, C) = P(X|Y, Θ) = p(xi |yi , Θ), (5.6)
i=1
joint probability The joint probability in Eq. (5.7) has three factors. The first factor describes a
factorization probability distribution over the model parameters preventing them from converg-
ing to degenerate values, thereby providing regularization. The second factor is the
conditional probability of observing a particular label configuration given the pro-
vided constraints, effectively assigning a higher probability to configurations where
the cluster assignments do not violate the constraints. Finally, the third factor is
the conditional probability of generating the observed data points given the labels
and the parameters: if maximum-likelihood (ML) estimation was performed on the
HMRF, the goal would have been to maximize this term in isolation.
Overall, maximizing the joint HMRF probability in (5.7) is equivalent to jointly
maximizing the likelihood of generating data points from the model and the
probability of label assignments that respect the constraints, while regularizing
the model parameters.
corresponding to the yi th cluster is μyi , the mean of the points of that cluster.
Using this assumption and the bijection between regular exponential distributions
and regular Bregman divergences (Banerjee et al., 2005b), the conditional density
for observed data can be represented as
1
p(xi |yi , Θ) = exp −dA (xi , μh ) , (5.8)
ZΘ
where dA (xi , μyi ) is the Bregman divergence between xi and μyi , corresponding to
the exponential density p, and ZΘ is the normalizer.2 Different clustering models
fall into this exponential form:
If xi and μyi are vectors in Euclidean space, and dA is the square of the L2
distance parameterized by a positive semidefinite weight matrix A (dA (xi , μyi ) =
2
xi − μyi A ), then the cluster conditional probability is a Gaussian with covariance
encoded by A−1 (Kearns et al., 1997);
If xi and μyi are probability distributions and dA is the KL divergence
d xim
(dA (xi , μyi ) = m=1 xim log μyi m ), then the cluster conditional probability is a
multinomial distribution (Dhillon and Guan, 2003).
The relation in Eq. 5.8 holds even if dA is not a Bregman divergence but
a directional distance measure like cosine distance. For example, if xi and μyi
are vectors of unit length
Pd and dA is one minus the dot-product of the vectors
xim μyi m
dA (xi , μyi ) = 1 − m=1 xi μyi
, then the cluster conditional probability is a
von Mises Fisher (vMF) distribution with unit concentration parameter (Banerjee
et al., 2005a), which is essentially the spherical analog of a Gaussian. The connection
between specific distortion measures studied in this chapter and their corresponding
cluster conditional probabilities is discussed in more detail in section 5.3.3.
Putting Eq. 5.8 into 5.7 and taking logarithms gives the following cluster objective
function, minimizing which is equivalent to maximizing the joint probability over
the HMRF in Eq. 5.7:
Jobj = dA (xi , μyi ) + v(i, j) − log P(Θ) + log Z + n log ZΘ . (5.9)
xi ∈X cij ∈C
Thus, the task is to minimize Jobj over the hidden variables Y and Θ (note that
given Y , the means M = (μ1 , . . . , μK ) are uniquely determined).
2. When A = I (identity matrix), the bijection result (Banerjee et al., 2005b) ensures that
the normalizer ZΘ is 1. In general, there are additional multiplicative terms that depend
only on x, and hence can be safely ignored for parameter estimation purposes.
5.3 HMRF-KMeans Algorithm 81
Since the cluster assignments and the generative model parameters are unknown in
a clustering setting, minimizing Eq. 5.9 is an “incomplete-data problem”. A popular
solution technique for such problems is the expectation-maximization (EM) algo-
rithm (Dempster et al., 1977). The K-Means algorithm (MacQueen, 1967) is known
to be equivalent to the EM algorithm with hard clustering assignments, under cer-
tain assumptions (Kearns et al., 1997; Basu et al., 2002; Banerjee et al., 2005b). This
section describes a K-Means-type hard partitional clustering algorithm, HMRF-
KMeans, that finds a local minimum of the semi-supervised clustering objective
function Jobj in Eq. 5.9.
Following the definition of Θ in section 5.2.1, the prior term log P(Θ) in (5.9) and
the subsequent equations can be factored as follows:
2
a
aij exp − sij2
P(A) = , (5.10)
s2
aij ∈A
For many realistic data sets, off-the-shelf distortion measures may fail to capture
the correct notion of similarity in a clustering setting. While some unsupervised
measures like squared Euclidean distance and Pearson’s distance attempt to correct
distortion estimates using the global mean and variance of the data set, these mea-
sures may still fail to estimate distances accurately if the attributes’ true contribu-
tions to the distance are not correlated with their variance. Several semi-supervised
clustering approaches exist that incorporate adaptive distortion measures, including
parameterizations of Jensen-Shannon divergence (Cohn et al., 2003) and squared
Euclidean distance (Bar-Hillel et al., 2003; Xing et al., 2003). However, these tech-
niques use only constraints to learn the distortion measure parameters and exclude
unlabeled data from the parameter learning step, as well as separate the parameter
learning step from the clustering process.
Going a step further, the HMRF model provides an integrated framework
adaptive which incorporates both learning the distortion measure parameters and constraint-
distortion sensitive cluster assignments. In HMRF-KMeans, the parameters of the distortion
measure measure are learned iteratively as the clustering progresses, utilizing both unlabeled
data and pairwise constraints. The parameters are modified to decrease the param-
eterized distance between violated must-linked constraints and increase it between
violated cannot-link constraints, while allowing constraint violations if they accom-
pany a more cohesive clustering.
This section presents three examples of distortion functions and their parame-
terizations for use with HMRF-KMeans: squared Euclidean distance, cosine dis-
tance, and KL divergence. Through parameterization, each of these functions be-
comes adaptive in a semi-supervised clustering setting, permitting clusters of vary-
ing shapes.
Once a distortion measure is chosen for a given domain, the functions fML
constraint and fCL , introduced in section 5.2.2 for penalizing must-link and cannot-link
potential function constraint violations, respectively, must be defined. These functions typically follow
a functional form identical or similar to the corresponding distortion measure, and
are chosen as follows:
attempt to mend the violations. The ϕ function for different clustering distortion
measures is discussed in the following sections.
Accordingly, the potential function v(i, j) in (5.4) becomes
⎧
⎪
⎪ wij ϕ(xi , xj ) if cij = 1 and yi = yj
⎨
v(i, j) = max
wij ϕ − ϕ(xi , xj ) if cij = −1 and yi = yj , (5.13)
⎪
⎪
⎩ 0 otherwise
and the objective function for semi-supervised clustering in (5.9) can be expressed
as
Jobj = dA (xi , μ(i)) + wij ϕ(xi , xj )
xi ∈X (xi ,xj )∈CM L
s.t. yi =yj
+ wij ϕmax − ϕ(xi , xj ) − log P(A) + n log ZΘ . (5.14)
(xi ,xj )∈CCL
s.t. yi =yj
Note that as discussed in section 5.3.1, the MRF partition function term log Z has
been dropped from the objective function.
JeucA = deucA (xi , μ(i)) + wij deucA (xi , xj )
xi ∈X (xi ,xj )∈CM L
s.t. yi =yj
+ wij ϕmax
eucA − deucA (xi , xj ) − log P(A) − n log det(A).
(xi ,xj )∈CCL
s.t. yi =yj
(5.16)
Note that as discussed in section 5.3.1, the log ZΘ term is computable in closed-
form for a Gaussian distribution with covariance matrix A−1 , which is the underly-
ing cluster conditional probability distribution for parameterized squared Euclidean
distance. The log det(A) term (5.16) corresponds to the log ZΘ term in this case.
xTi Axj
dcosA (xi , xj ) = 1 − . (5.17)
xi A xj A
Because for realistic high-dimensional domains computing the full matrix A would
be computationally expensive, a diagonal matrix is considered in this case, such
that a = diag(A) is a vector of positive weights.
To use parameterized squared Euclidean distance as the adaptive distortion
measure for clustering, the ϕ function is defined as ϕ(xi , xj ) = dcosA (xi , xj ). Using
this definition along with Eq. 5.14, and setting ϕmax = 1 as an upper bound on
ϕ(xi , xj ), the following objective function is obtained for semi-supervised clustering
with adaptive cosine distance:
JcosA = dcosA (xi , μ(i)) + wij dcosA (xi , xj )
xi ∈X (xi ,xj )∈CM L
s.t. yi =yj
+ wij 1 − dcosA (xi , xj ) − log P(A). (5.18)
(xi ,xj )∈CCL
s.t. yi =yj
Note that as discussed in section 5.3.1, it is difficult to compute the log ZΘ term
in closed form for parameterized cosine distance. So, the simplifying assumption is
made that log ZΘ is constant during the clustering process and the normalizer term
is dropped from (5.18).
86 Probabilistic Semi-Supervised Clustering with Constraints
d
d
xim
dIA (xi , xj ) = am xim log − am (xim − xjm ), (5.19)
m=1
xjm m=1
d
2xim 2xjm
dIMA (xi , xj ) = am xim log + xjm log . (5.20)
m=1
xim + xjm xim + xjm
JIA = dIA (xi , μ(i)) + wij dIMA (xi , xj )
xi ∈X (xi ,xj )∈CM L
s.t. yi =yj
+ wij dmax
IMA − dIMA (xi , xj ) − log P(A). (5.21)
(xi ,xj )∈CCL
s.t. yi =yj
d
The upper bound dmaxIMA can be initialized as dIMA =
max
m=1 am , which follows
from the fact that unweighted Jensen-Shannon divergence is bounded above by
1 (Lin, 1991).
Note that as discussed in section 5.3.1, it is difficult to compute the log ZΘ term in
closed form for parameterized KL distance. So, analogously to the parameterized
cosine distance case, the simplifying assumption is made that log ZΘ is constant
during the clustering process and that term is dropped from Eq. 5.21.
5.3.4 EM Framework
5.3.5 Initialization
Good initial centroids are essential for the success of partitional clustering algo-
rithms such as K-Means. Good centroids are inferred from both the constraints
and unlabeled data during initialization. For this, a two-stage initialization process
is used.
88 Probabilistic Semi-Supervised Clustering with Constraints
Cluster Selection The λ neighborhood sets produced in the first stage are used
to initialize the HMRF-Means algorithm. If λ = K, λ cluster centers are initialized
with the centroids of all the λ neighborhood sets. If λ < K, λ clusters are initialized
from the neighborhoods, and the remaining K −λ clusters are initialized with points
obtained by random perturbations of the global centroid of X. If λ > K, a weighted
variant of farthest-first traversal (Hochbaum and Shmoys, 1985) is applied to the
centroids of the λ neighborhoods, where the weight of each centroid is proportional
to the size of the corresponding neighborhood. Weighted farthest-first traversal
selects neighborhoods that are relatively far apart as well as large in size, and
the chosen neighborhoods are set as the K initial cluster centroids for HMRF-
KMeans.
Overall, this two-stage initialization procedure is able to take into account both
unlabeled and labeled data to obtain cluster representatives that provide a good
initial partitioning of the data set.
5.3.6 E Step
In the E step, assignments of data points to clusters are updated using the current
estimates of the cluster representatives. In the general unsupervised K-Means
algorithm, there is no interaction between the cluster labels, and the E step is
a simple assignment of every point to the cluster representative that is nearest to
5.3 HMRF-KMeans Algorithm 89
Jobj (xi , μh ) = dA (xi , μh ) + wij ϕ(xi , xj )
i
(xi ,xj )∈CM L
s.t. yi =yj
+ wij ϕmax − ϕ(xi , xj ) − log P(A), (5.22)
i
(xi ,xj )∈CCL
s.t. yi =yj
i i
where CML and CCL are the subsets of CML and CCL respectively in which xi
appears in the constraints. The optimal assignment for every point minimizes the
distortion between the point and its cluster representative (first term of Jobj ) along
with incurring a minimal penalty for constraint violations caused by this assignment
(second and third terms of Jobj ). After all points are assigned, they are randomly
reordered, and the assignment process is repeated. This process proceeds until no
point changes its cluster assignment between two successive iterations.
Overall, the assignment of points to clusters incorporates pairwise supervision by
discouraging constraint violations proportionally to their severity, which guides the
algorithm toward a desirable partitioning of the data.
5.3.7 M Step
The M step of the algorithm consists of two parts: centroid re-estimation and
distortion measure parameter update.
90 Probabilistic Semi-Supervised Clustering with Constraints
In the first part of the M step, the cluster centroids M are re-estimated from points
currently assigned to them, to decrease the objective function Jobj in Eq. 5.9. For
Bregman divergences and cosine distance, the cluster representative calculated in
the M step of the EM algorithm is equivalent to the expectation value over the
points in that cluster, which is equal to their arithmetic mean (Banerjee et al.,
2005a,b). Additionally, it has been experimentally demonstrated that for clustering
with distribution-based measures, e.g., KL divergence, smoothing cluster represen-
tatives by a prior using a deterministic annealing schedule leads to considerable
improvements (Dhillon and Guan, 2003). With smoothing controlled by a positive
parameter α, each cluster representative μh is estimated as follows when dIA is the
distortion measure:
(I ) 1 xi ∈Xh xi α
μh A = + 1 . (5.23)
1+α |Xh | n
For directional measures, each cluster representative is the arithmetic mean
projected onto unit sphere (Banerjee et al., 2005a). Taking the distortion parameters
into account, centroids are estimated as follows when dcosA is the distortion measure:
(cosA )
μh xi
(cosA )
= xi ∈Xh . (5.24)
μh A xi ∈Xh xi A
In the second part of the M step, the parameters of the parameterized distortion
measure are updated to decrease the objective function. In general, for parameter-
ized Bregman divergences or directional distances with general parameter priors,
it is difficult to attain a closed-form update for the parameters of the distortion
measure that can minimize the objective function.4 Gradient descent provides an
alternative avenue for learning the distortion measure parameters.
For squared Euclidean distance, a full parameter matrix A is updated during
∂JeucA
gradient update gradient descent using the rule: A = A + η ∂A (where η is the learning rate).
∂JeucA
for full A Using (5.16), ∂A can be expressed as
4. For the specific case of parameterized squared Euclidean distance, a closed-form update
of the parameters can be obtained (Bilenko et al., 2004).
5.3 HMRF-KMeans Algorithm 91
∂deucA (xi , xj )
= (xi − xj )(xi − xj )T .
∂A
∂ϕmax
The derivative of the upper bound ϕmax
eucA is
eucA
∂A = (xi ,xj )∈CCL (xi − xj )(xi −
xj )T if ϕmax 5
eucA is computed as described in section 5.3.3.1.
When Rayleigh priors are used on the set of parameters A, the partial derivative
of the log-prior with respect to every individual parameter am ∈ A, ∂ log P(A)
∂am , is
given by
∂ log P(A) 1 am
= − 2. (5.26)
∂am am s
The gradient of the distortion normalizer log det(A) term is as follows:
∂ log det(A)
= 2A−1 − diag(A−1 ). (5.27)
∂A
For parameterized cosine distance and KL divergence, a diagonal parameter
gradient update matrix A is considered, where a = diag(A) is a vector of positive weights. During
∂J
for diagonal A gradient descent, each weight am is individually updated as am = am + η ∂aobjm
(η
∂Jobj
is the learning rate). Using (5.14), ∂am can be expressed as
∂J
Calculation of the gradient ∂aobj
m
for cosine distance and KL divergence, which
are parameterized by a diagonal matrix A, needs the gradients of the corresponding
distortion measures and constraint potential functions, which are
dA (xi , μ(i))+ wij ϕ(xi , xj )+ wij ϕmax −ϕ(xi , xj ) −log P(A).
i i
(xi ,xj )∈CM L (xi ,xj )∈CCL
s.t. yi =yj s.t. yi =yj
Given a set of centroids and distortion parameters, the new cluster assignment of
points will decrease Jobj or keep it unchanged.
For analyzing the centroid re-estimation step, let us consider an equivalent form
of Eq. 5.14:
5.4 Active Learning for Constraint Acquisition 93
K
Jobj = dA (xi , μh ) + wij ϕ(xi , xj )
h=1 xi ∈Xh (xi ,xj )∈CM i
L
s.t. yi =yj
+ wij ϕmax − ϕ(xi , xj ) − log P(A). (5.30)
i
(xi ,xj )∈CCL
s.t. yi =yj
Each cluster centroid μh is re-estimated by taking the mean of the points in the
partition Xh , which minimizes the component xi ∈Xh dA (xi , μh ) of Jobj in Eq. 5.30
contributed by the partition Xh . The constraint potential and the prior term in the
objective function do not take a part in centroid re-estimation, because they are
not explicit functions of the centroid. So, given the cluster assignments and the
distortion parameters, Jobj will decrease or remain the same in this step.
For the parameter estimation step, the gradient-descent update of the parameters
in M step (B) decreases Jobj or keeps it unchanged. Hence the objective function
decreases after every cluster assignment, centroid re-estimation, and parameter
re-estimation step. Now, note that the objective function is bounded below by
a constant: being the negative log likelihood of a probabilistic model with the
normalizer terms, Jobj is bounded below by zero. Even without the normalizers,
the objective function is bounded below by zero, since the distortion and potential
terms are non-negative due to the fact that A is positive definite. Since Jobj is
bounded below, and HMRF-KMeans results in a decreasing sequence of objective
function values, the value sequence must have a limit. The limit in this case will
be a fixed point of Jobj since neither updating the assignments nor the parameters
can further decrease the value of the objective function. As a result, the HMRF-
KMeans algorithm will converge to a fixed point of the objective. In practice,
convergence can be determined if subsequent iterations of HMRF-KMeans result
in insignificant changes in Jobj .
In the semi-supervised setting where training data are not already available, getting
constraints on pairs of data points may be expensive. In this section an active
learning scheme for the HMRF model is presented, which can improve clustering
performance with as few queries as possible. Formally, the scheme has access to a
(noiseless) oracle that can assign a must-link or cannot-link label to a given pair
(xi , xj ), and it can pose a constant number of queries to the oracle.6
In order to get pairwise constraints that are more informative than random in
6. The oracle can also give a don’t-know response to a query, in which case that response
is ignored (pair not considered as a constraint) and that query is not posed again later.
94 Probabilistic Semi-Supervised Clustering with Constraints
the HMRF model, an active learning scheme for selecting pairwise constraints using
farthest-first the farthest-first traversal scheme is developed. In farthest-first traversal, a starting
traversal point is first selected at random. Then, the next point farthest from it is chosen and
added to the traversed set. After that, the next point farthest from the traversed
set (using the standard notion of distance from a set: d(x, S) = minx′ ∈S d(x, x′ ))
is selected, and so on. Farthest-first traversal gives an efficient approximation of
the K-center problem (Hochbaum and Shmoys, 1985), and has also been used to
construct hierarchical clusterings with performance guarantees at each level of the
hierarchy (Dasgupta, 2002).
Basu et al. (2002) observed that initializing K-Means with centroids esti-
mated from a set of labeled examples for each cluster gives significant perfor-
good mance improvements. Under certain generative model-based assumptions, one can
initialization for connect the mixture of Gaussians model to K-Means with squared Euclidean dis-
K-Means tance (Kearns et al., 1997). A direct calculation using Chernoff bounds shows that
if a particular cluster with an underlying Gaussian model is seeded with points
drawn independently at random from the corresponding Gaussian distribution, the
deviation of the centroid estimates falls exponentially with the number of seeds;
hence seeding results in good initial centroids. Since good initial centroids are very
critical for the success of greedy algorithms such as K-Means, the same principle
is followed for the pairwise case: the goal is to get as many points as possible per
cluster (proportional to the actual cluster size) by asking pairwise queries, so that
HMRF-KMeans is initialized from a very good set of centroids. The proposed
active learning scheme has two phases, Explore and Consolidate, which are
discussed next.
5.4.1 Exploration
The Explore phase explores the given data using farthest-first traversal to get K
pairwise disjoint non-null neighborhoods as fast as possible, with each neighborhood
belonging to a different cluster in the underlying clustering of the data. Note that
even if there is only one point per neighborhood, this neighborhood structure
form skeleton of defines a correct skeleton of the underlying clustering. Our algorithm Explore
neighborhoods (algorithm 5.2) uses farthest-first traversal for getting a skeleton structure of the
neighborhoods, and terminates when it has run out of queries, or when at least
one point from all the clusters has been labeled. In the latter case, active learning
enters the consolidation phase.
5.4.2 Consolidation
The basic idea in Consolidate (algorithm 5.3) is as follows: since there is at least
one labeled point from all the clusters, the proper neighborhood of any unlabeled
consolidate point x can be determined within a maximum of (K − 1) queries. The queries will
neighborhoods be formed by taking a point y from each of the neighborhoods in turn and asking
for the label on the pair (x, y) until a must-link is obtained. Either a must-link reply
5.4 Active Learning for Constraint Acquisition 95
When the right number of clusters K is not known to the clustering algorithm,
K is also unknown to the active learning scheme. In this case, only Explore is
used while queries are allowed. Explore will keep discovering new clusters as fast
as it can. When it has obtained all the clusters, it will not have any way of knowing
this. However, from this point onward, for every farthest-first x it draws from the
data set, it will always find a neighborhood that is must-linked to it. Hence, after
96 Probabilistic Semi-Supervised Clustering with Constraints
0.8 0.8
KMeans-C-D-R KMeans-C-D-R
KMeans-C-D KMeans-C-D
0.7 KMeans-C 0.7 KMeans-C
KMeans KMeans
0.6 0.6
0.5 0.5
NMI
NMI
0.4 0.4
0.3 0.3
0.2 0.2
0.1 0.1
0 0
0 100 200 300 400 500 0 100 200 300 400 500
Number of Constraints Number of Constraints
Figure 5.3 Clustering results for Dcosa Figure 5.4 Clustering results for DIa on
on News-Different-3 data set. News-Different-3 data set.
discovering all of the clusters, Explore will essentially consolidate the clusters too.
However, when K is known, it makes sense to invoke Consolidate since (1) it adds
points to clusters at a faster rate than Explore, and (2) it picks random samples
following the underlying data distribution, which is advantageous for estimating
good centroids (e.g., Chernoff bounds on the centroid estimates exist), while samples
obtained using farthest-first traversal may not have such properties.
7. http://www.vivisimo.com
5.5 Experimental Results 97
0.35 0.35
KMeans-C-D-R KMeans-C-D-R
KMeans-C-D KMeans-C-D
KMeans-C KMeans-C
0.3 KMeans KMeans
0.3
0.25
0.25
0.2
NMI
NMI
0.2
0.15
0.15
0.1
0.1
0.05
0 0.05
0 100 200 300 400 500 0 100 200 300 400 500
Number of Constraints Number of Constraints
Figure 5.5 Clustering results for Dcosa Figure 5.6 Clustering results for DIa on
on News-Related-3 data set. News-Related-3 data set.
0.35 0.35
KMeans-C-D-R KMeans-C-D-R
KMeans-C-D KMeans-C-D
KMeans-C KMeans-C
0.3 KMeans 0.3 KMeans
0.25 0.25
0.2 0.2
NMI
NMI
0.15 0.15
0.1 0.1
0.05 0.05
0 0
0 100 200 300 400 500 0 100 200 300 400 500
Number of Constraints Number of Constraints
Figure 5.7 Clustering results for Dcosa Figure 5.8 Clustering results for DIa on
on News-Similar-3 data set. News-Similar-3 data set.
small number of all the possible words. On such data sets, clustering algorithms
can easily get stuck in local optima: in such cases it has been observed that there is
little relocation of documents between clusters for most initializations, which leads
to poor clustering quality after convergence of the algorithm (Dhillon and Guan,
2003). Supervision in the form of pairwise constraints is most beneficial in such
cases and may significantly improve clustering quality.
We derived three data sets from the 20-Newsgroups collection.8 This collection
has messages harvested from 20 different Usenet newsgroups, 1000 messages from
each newsgroup. From the original data set, a reduced data set was created by tak-
ing a random subsample of 100 documents from each of the 20 newsgroups. Three
data sets were created by selecting three categories from the reduced collection.
News-Similar-3 consists of three newsgroups on similar topics (comp.graphics,
comp.os.ms-windows, comp.windows.x) with significant overlap between clusters
due to cross-posting. News-Related-3 consists of three newsgroups on related top-
ics (talk.politics.misc, talk.politics.guns, and talk.politics.mideast).
News-Different-3 consists of articles posted in three newsgroups that cover differ-
ent topics (alt.atheism, rec.sport.baseball, sci.space) with well-separated
8. http://www.ai.mit.edu/people/jrennie/20Newsgroups
98 Probabilistic Semi-Supervised Clustering with Constraints
clusters. The vector-space model of News-Similar-3 has 300 points in 1864 dimen-
sions, News-Related-3 has 300 points in 3225 dimensions, and News-Different-3 had
300 points in 3251 dimensions. Since the overlap between topics in News-Similar-3
and News-Related-3 is significant, they are more challenging data sets than News-
Different-3.
All the data sets were preprocessed by stopword removal, TF-IDF weighting, re-
moval of very high-frequency and low-frequency words, etc., following the method-
ology of Dhillon et al. (Dhillon and Modha, 2001).
5.5.3 Methodology
cross-validation with a holdout set. The clustering algorithm was run on the whole
data set, but NMI was calculated only on the test set. The learning curve results
were averaged over the 20 runs.
proposed directly injecting the constraints into the affinity matrix before subse-
quent clustering, while De Bie et al. (2004) reformulated the optimization problem
corresponding to spectral clustering by incorporating a separate label constraint
matrix. Additionally, spectral clustering methods can be viewed as variants of the
graph-cut approaches to clustering (Shi and Malik, 2000), a connection that mo-
tivated the correlation clustering method proposed by (Bansal et al., 2002), where
the constraints correspond to edge labels between vertices representing data points.
Another family of semi-supervised clustering methods has focused on modifying
the distance function employed by the clustering algorithm. In early work, Cohn
et al. (2003) proposed using a weighted variant of Jensen-Shannon divergence within
the EM clustering algorithm, with the weights learned using gradient descent based
on constraint violations. Within the family of hierarchical agglomerative clustering
algorithms, Klein et al. (2002) proposed modifying the squared Euclidean distance
using the shortest-path algorithm. Several researchers have proposed methods for
learning the parameters of the weighted Mahalanobis distance, a generalization of
Euclidean distance, within the context of semi-supervised clustering. Xing et al.
(2003) utilized convex optimization and iterative projections to learn the weight
matrix of Mahalanobis distance within K-Means clustering. Another approach
focused on parameterized Mahalanobis distance is the relevant component analysis
(RCA) algorithm proposed by Bar-Hillel et al. (2003), where convex optimization
is also used to learn the weight matrix.
Learning distance metrics within semi-supervised clustering relates to a large set
of approaches for transforming the data representation to make it more suitable
to a particular learning task. Within this book, part IV (chapters 15–17) describes
several advanced techniques for changing the geometry of the data space to obtain
better estimates of similarity between data points; integrating these methods with
clustering algorithms provides a number of promising avenues for future work.
5.7 Conclusions
presented for acquiring supervision from a user in the form of effective pairwise
constraints for semi-supervised clustering – such an active learning algorithm would
be useful in an interactive query-driven clustering framework.
The HMRF model can be viewed as a unification of constraint-based and
distance-based semi-supervised clustering approaches. It can be expanded to a
more general setting where every cluster has a corresponding distinct distortion
measure (Bilenko et al., 2004), leading to a clustering algorithm that can identify
clusters of different shapes. Empirical evaluation of the framework described in
this chapter can be found in several previous publications: active learning experi-
ments are discussed in (Basu et al., 2004a), while (Bilenko et al., 2004) and (Basu
et al., 2004b) contain results for low-dimensional and high-dimensional data sets
respectively, and (Bilenko and Basu, 2004) compares several approximate inference
methods for E Step discussed in section 5.3.6.
An important practical issue in using generative models for semi-supervised
learning is model selection. For semi-supervised clustering with constraints, the
key model selection issue is one of choosing the right number of clusters. One can
consider using a traditional model selection criterion suitable for the supervised
setting, or perform model selection by cross-validation. An alternative is to perform
model-selection using bounds on the test-set error rate such that valuable supervised
data are saved for learning. The PAC-MDL bounds (Blum and Langford, 2003)
provide such a tool that has been successfully applied to model selection for
clustering (Banerjee et al., 2005a), and can be readily extended to the semi-
supervised clustering setting. In fact, the semi-supervised clustering setting is
more natural since PAC-MDL bounds are applicable for transductive learning.
Alternative methods of model selection are a good topic for future research.
II Low-Density Separation
6 Transductive Support Vector Machines
6.1 Introduction
the documents in the database). Second, the test examples are known a priori and
can be observed by the learning algorithm during training. This allows the learning
algorithm to exploit any information that might be contained in the location of
the test examples. Transductive learning is therefore a particular case of semi-
supervised learning, since it allows the learning algorithm to exploit the unlabeled
examples in the test set. The following focuses on this second point, while chapter 24
elaborates on the first point.
transductive More formally, the transductive learning setting can be formalized as follows.1
learning setting Given is a set
that enumerates all n possible examples. In our relevance feedback example from
above, there would be one index i for each document in the collection. We assume
that each example i is represented by a feature vector xi ∈ Rd . For text documents,
this could be a TFIDF vector representation (see e.g. (Joachims, 2002)), where
each document is represented by a scaled and normalized histogram of the words
it contains. The collection of feature vectors for all examples in S is denoted as
When training a transductive learning algorithm L, it not only has access to the
training vectors Xtrain and the training labels Ytrain ,
Xtrain = (xl1 , xl2 , ..., xll ) Ytrain = (yl1 , yl2 , ..., yll ), (6.5)
The transductive learner uses Xtrain , Ytrain , and Xtest (but not the labels Ytest of
1. While several other, more general, definitions of transductive learning exist (Vapnik,
1998; Joachims, 2002; Derbeko et al., 2003), this one was chosen for the sake of simplicity.
6.1 Introduction 107
for the labels of the test examples. The learner’s goal is to minimize the fraction of
erroneous predictions,
∗ 1
Errtest (Ytest )= δ0/1 (yi∗ , yi ), (6.8)
u
i∈Stest
The structure should reflect prior knowledge about the learning task. In particular,
the structure should be constructed so that, with high probability, the correct
labeling of S (or labelings that make few errors) is contained in an element Hi
of small cardinality. This structuring of the hypothesis space H can be motivated
using generalization error bounds from statistical learning theory. In particular, for
∗ ∗
a learner L that searches for a hypothesis (Ytrain , Ytest ) ∈ Hi with small training
error,
∗ 1
Errtest (Ytrain )= δ0/1 (yi∗ , yi ), (6.10)
l
i∈Strain
∗
it is possible to upper-bound the fraction of test errors Errtest (Ytest ) (Vapnik, 1998;
transductive Derbeko et al., 2003). With probability 1 − η
generalization ∗ ∗
error bound Errtest (Ytest ) ≤ Errtrain (Ytrain ) + Ω(l, u, |Hi |, η) (6.11)
where the confidence interval Ω(l, u, |Hi |, η) depends on the number of training
examples l, the number of test examples u, and the cardinality |Hi | of Hi (see
(Vapnik, 1998) for details). The smaller the cardinality |Hi |, the smaller is the
confidence interval Ω(l, u, |Hi |, η) on the deviation between training and test error.
The bound indicates that a good structure ensures accurate prediction of the
test labels. And here lies a crucial difference between transductive and inductive
learners. Unlike in the inductive setting, we can study the location Xtest of the test
108 Transductive Support Vector Machines
Figure 6.1 The two graphs illustrate the labelings that margin hyperplanes can realize
dependent on the margin size. Example points are indicated as dots: the margin of each
hyperplane is illustrated by the gray area. The left graph shows the separators H ρ for a
small margin threshold ρ. The number of possible labelings Nρ decreases as the margin
threshold is increased, as in the graph on the right.
train and test set Transductive support vector machines (TSVMs) assume a particular geometric
margin relationship between X = (x1 , ..., xn ) and P (y1 , ..., yn ). They build a structure
on H based on the margin of hyperplanes {x : w · x + b = 0} on the complete
sample X = (x1 , x2 , ..., xn ), including both the training and the test vectors. The
margin of a hyperplane on X is the minimum distance to the closest example vectors
in X.
yi
min (w · xi + b) (6.12)
i∈[1..n] w
The structure element Hρ contains all labelings of X which can be achieved with
hyperplane classifiers h(x) = sign{x · w + b} that have a margin of at least ρ
on X. The dependence of Hρ on ρ is illustrated in figure 6.1. Intuitively, building
the structure based on the margin gives preference to labelings that follow cluster
boundaries over labelings that cut through clusters. Vapnik shows that the size of
the margin ρ can be used to control the cardinality of the corresponding set of
6.2 Transductive Support Vector Machines 109
Figure 6.2 For the same data as in figure 6.1, some examples are now labeled. Posi-
tive/negative examples are marked as +/−. The dashed line is the solution of an inductive
SVM, which finds the hyperplane that separates the training data with largest margin, but
ignores the test vectors. The solid line shows the hard-margin transductive classification,
which is the labeling that has zero training error and the largest margin with respect to
both the training and the test vectors. The TSVM solution aligns the labeling with the
cluster structure in the training and test vectors.
Solving this problem means finding the labeling yu∗ 1 , ..., yu∗ k of the test data for
which the hyperplane that separates both training and test data has maximum
margin. Figure 6.2 illustrates this. The figure also shows the solution that an
inductive SVM inductive SVM (Cortes and Vapnik, 1995; Vapnik, 1998) computes. An inductive
SVM also finds a large-margin hyperplane, but it considers only the training
vectors while ignoring all test vectors. In particular, a hard-margin inductive SVM
computes the separating hyperplane that has zero training error and the largest
margin with respect to the training examples.
To be able to handle nonseparable data, one can introduce slack variables ξi
(Joachims, 1999) similar to inductive SVMs (Cortes and Vapnik, 1995).
soft-margin
TSVM OP2 (Transductive SVM (soft-margin))
l u
1
min: W (yu∗ 1 , ..., yu∗ u , w, b, ξ1 , ..., ξl , ξ1∗ , ..., ξu∗ ) = w·w + C ξi + C ∗ ξj∗ (6.19)
2 i=1 j=1
C and C ∗ are parameters set by the user. They allow trading off margin size
against misclassifying training examples or excluding test examples. C ∗ can be
used reduce sensitivity toward outliers (i.e., single examples falsely reducing the
margin on the test data).
kernels Both inductive and transductive SVMs can be extended to include kernels (Boser
et al., 1992; Vapnik, 1998). Making use of duality techniques from optimization
theory, kernels allow learning nonlinear rules as well as classification rules over
nonvectorial data (see e.g. (Schölkopf and Smola, 2002)) without substantially
changing the optimization problems.
Note that in both the hard-margin formulation (OP1) and the soft-margin formu-
lation (OP2) of the TSVM, the labels of the test examples enter as integer variables.
Due to the constraints in Eqs. 6.18 and 6.22 respectively, both OP1 and OP2 are
no longer convex quadratic programs like the analogous optimization problems for
inductive SVMs. Before discussing methods for (approximately) solving the TSVM
6.3 Why Use Margin on the Test Set? 111
D1 1 1
D2 1 1 1 1
D3 1 1
D4 1 1 1
D5 1 1 1
D6 1 1 1
optimization problems, let’s first discuss some intuition about why structuring the
hypothesis space based on the margin on the test examples might be reasonable.
Why should it be reasonable to prefer a labeling with a large margin over a labeling
with a smaller margin, even if both have the same training error? Clearly, this
question can only be addressed in the context of a particular learning problem. In
the following, we will consider text classification as an example. In particular, for
topic-based text classification it is known that good classification rules typically
have a large margin (Joachims, 2002). The following example gives some intuition
for why this is the case.
In the field of information retrieval it is well known that words in natural language
occur in co-occurrence patterns (see e.g. (van Rijsbergen, 1977)). Some words are
likely to occur together in one document; others are not. For examples, when
asking Google about all documents containing the words pepper and salt, it
returns 3,500,000 webpages. When asking for the documents with the words pepper
and physics, we get only 248,000 hits, although physics (162,000,000 hits) is a
more popular word on the web than salt (63,200,000 hits). Many approaches in
information retrieval try to exploit this cluster structure of text (see e.g. (Baeza-
Yates and Ribeiro-Neto, 1999, chapter 5)). It is this co-occurrence information that
TSVMs exploit as prior knowledge about the learning task.
Consider the example in figure 6.3. Imagine document D1 was given as a training
example for class A and document D6 was given as a training example for class
B. How should we classify documents D2 to D5 (the test set)? Even if we did
not understand the meaning of the words, we would classify D2 and D3 into class
A, and D4 and D5 into class B. We would do so even though D1 and D3 do
not share any informative words. The reason we choose this classification of the
test data over the others stems from our prior knowledge about the properties of
text and common text-classification tasks. Often we want to classify documents by
112 Transductive Support Vector Machines
100
60
40
20 Transductive SVM
SVM
Naive Bayes
0
17 26 46 88 170 326 640 1200 2400 4801 9603
Number of Examples in training set
Figure 6.4 Macro-averaged PRBEP on the Reuters data set for different training set
sizes and a test set size of 3299.
topic, source, or style. For these types of classification tasks we find stronger co-
occurrence patterns within classes than between different classes. In our example
we analyzed the co-occurrence information in the test data and found two clusters.
These clusters indicate different topics of {D1, D2, D3} versus {D4, D5, D6}, and
we choose the cluster separator as our classification. Note again that we got to this
classification by studying the location of the test examples, which is not possible
for an inductive learner.
The TSVM outputs the same classification as we suggested above, although all
16 labelings of D2 to D5 can be achieved with linear separators. Assigning D2
and D3 to class A and D4 and D5 to class B is the maximum-margin solution
(i.e., the solution of OP1). The maximum-margin bias appears to reflect our prior
knowledge about text classification well. By measuring margin on the test set, the
TSVM exploits co-occurrence patterns that indicate boundaries between topics.
Structuring the hypothesis space using margin was obviously beneficial in the toy
example above. Experiments have confirmed that this also holds in practice.
TSVMs in text Figures 6.4 and 6.5 (from Joachims (1999)) give empirical evidence that
classification TSVMs improve prediction performance on real text-classification tasks, namely
the Reuters-21578 text-classification benchmark. The standard “ModApte” train-
ing/test split is used, leading to a corpus of 9603 training documents and 3299
test documents. The results are averaged over the ten most frequent topics, while
keeping all documents. Each topic leads to a binary classification problem, where
documents about the topic are positive examples, and all other documents are neg-
6.4 Experiments and Applications of TSVMs 113
100
90
where the algorithm can ask for the labels of particular examples, dominates the
TSVMs for image improvement seen from the TSVM. This is in contrast to the findings of Wang et al.
retrieval (2003). They find that incorporating TSVMs into their active learning procedure
for image retrieval based on relevance feedback substantially improves performance.
For more text-classification experiments see chapter 3.
TSVMs for UCI Beyond text classification, Bennett and Demiriz (1999) have applied their L 1 -
benchmarks norm variant of transductive SVMs to several UCI benchmark problems. They find
small but fairly consistent improvements over these tasks. A key difference from
most other experiments with transductive learning are the small test sets that
were used. Due to efficiency limitations of the mixed-integer programming code
they used for training, all test sets contained no more than 70 examples. Their
evaluation of regular TSVMs on a subset of these UCI benchmarks shows mixed
results (Demiriz and Bennett, 2000). Similar findings on UCI benchmarks are also
reported by Joachims (2003), where the differences between inductive SVMs and
TSVMs were found to be small.
TSVMs in Several applications of TSVMs in bioinformatics have been explored. For exam-
bioinformatics ple, they have been used to recognize promoter sequences in genes. Kasabov and
Pang (2004) report that TSVMs substantially outperform inductive SVMs in their
experiments. However, for the problem of predicting the functional properties of
proteins, Krogel and Scheffer (2004) find that TSVMs significantly decrease perfor-
mance compared to inductive SVMs.
TSVMs for Goutte et al. (2002) apply TSVMs to a problem of recognizing entities (e.g., gene
named entity names, protein names) in medical text. They find that TSVMs substantially im-
recognition prove performance for medium-sized training sets, and perform at least comparably
to an alternative transductive learning method based on Fisher kernels.
Summarizing the results, it appears that TSVMs are particularly well suited for
text classification and several other (typically high-dimensional) learning problems.
However, on some problems the TSVM performs roughly equivalently to an induc-
tive SVM, or sometimes even worse. This is to be expected, since it is likely that
structuring the hypothesis space according to margin size is inappropriate for some
applications. Furthermore, it is likely that the difficulty of finding the optimum of
the TSVM optimization problem has led to suboptimal results in some cases. We
discuss algorithms for solving the TSVM optimization problem next.
Both the hard soft-margin TSVM optimization problems can be written as mixed-
integer problems with a quadratic objective and linear constraints. Unfortunately,
currently no algorithm is known to efficiently find a globally optimal solution.
mixed-integer Vapnik and colleagues (Vapnik and Sterin, 1977; Wapnik and Tscherwonenkis,
programming 1979) proposed the use of branch-and-bound search to find the global optimium of
the TSVM optimization problem. Similarly, Bennett and Demiriz (1999) consider
standard mixed-integer programming software like CPLEX to solve a variant of
6.5 Solving the TSVM Optimization Problem 115
the TSVM optimization problem. To be able to use such software, they replace
2
the term w · w = w2 in the objective with w1 so that the objective becomes
linear. However, while both approaches produce globally optimal solutions, they
can solve only small problems with less than 100 test examples in reasonable time.
Unfortunately, figure 6.5 suggests that the biggest benefits of transductive learning
occur only for larger test sets.
SVMlight The algorithm implemented in SVMlight does not necessarily produce a globally
optimal solution, but can handle test sets with up to 100,000 examples in reasonable
time (Joachims, 1999, 2002). Most of the empirical results in the previous section
were produced using this algorithm. The algorithm performs a kind of coordinate-
descent local search starting from an initial labeling of the test examples derived
from an inductive SVM. The ratio of test examples that are classified as positive
(by adjusting the hyperplane threshold b) in this initial labeling is specified by the
user or estimated from the ratio of positive to negative examples in the training set.
This ratio is maintained throughout the optimization process to avoid degenerate
solutions that assign all test examples to the same class. 2 In every step of the
local search, the algorithm selects two examples (one positive and one negative)
and swaps their labels. The way the examples are selected guarantees a strict
improvement of the objective function (i.e., the soft margin) in every such step.
In addition, the algorithm starts with a small value of C ∗ and raises it throughout
the optimization process. This means that most ξ ∗ are non-zero in the initial phase
of the search, resulting in a smoother objective function. Toward the end of the
search, incrementally increasing the value of C ∗ toward the desired target value
makes the problem closer to the desired objective. A more detailed explanation of
the algorithm is given in (Joachims, 2002).
gradient descent A related block coordinate descent method was proposed by Demiriz and Bennett
(2000). The algorithm also alternates between changing the labels of the test exam-
ples and recomputing the margin. Differences compared to the SVMlight algorithm
lie in the selection of the labels to change, the number of labels that are changed
in each iteration, and in the heuristics that are aimed to avoid local optima. A
similar algorithm for the L1 -norm variant of the TSVM is described by Fung and
Mangasarian (2001).
semi-definite De Bie and Cristianini (2004a) explore a convex approximation of the TSVM
relaxation optimization problem (also see chapter 7). They present a relaxation that takes the
form of a semi-definite program. While this program can be solved in polynomial
time, it becomes too inefficient for test sets with more than 100 examples. However,
assuming a low-rank structure of the test labels derived from a spectral decomposi-
tion technique, De Bie and Cristianini push the efficieny limit to several thousands
of test examples.
2. In text classification, assigning all test examples to the same class typically gives larger
margins than any other labeling. Clearly, this is an undesirable solution and indicates
a problem with the TSVM approach. A method that does not exhibit this problem is
presented in Joachims (2003).
116 Transductive Support Vector Machines
graph cuts The difficulty in solving the TSVM optimization problem has led to much interest
in other formulations of transductive learning algorithms. The goal is to exploit the
same type of relationship between the geometry of the test examples — or unlabeled
examples more generally — and their labels, but that have computationally more
convenient properties. Graph partitioning approaches based on st-min-cuts (Blum
and Chawla, 2001) and spectral graph partitioning explicitly or implicitly pursued
this goal (Belkin and Niyogi, 2002; Chapelle et al., 2003; Joachims, 2003; Zhu
et al., 2003b) (see also chapters 11, 12, 13, 14, and 15). For example, the method
in (Joachims, 2003) is explicitly derived analogous to a TSVM as a transductive
version of the k-nearest neighbor classifier.
ridge regression Ridge regression is a method closely related to regression SVMs. Chapelle et al.
(1999) derive a tranductive variant of ridge regression. Since the class labels do
not need to be discrete for regression problems, they show that the solution of the
associated optimization problem can be computed efficiently.
co-training Co-training (Blum and Mitchell, 1998) exploits two redundant representations of
a learning problem for semi-supervised learning. A connection to general trans-
ductive learning comes from the insight that co-training produces transductive
learning problems that have large margin (Joachims, 2003, 2002). In fact, TSVMs
and spectral partitioning methods appear to perform well on co-training problems
(Joachims, 2003).
confidence Connecting to concepts of algorithmic randomness, Gammerman et al. (1998),
estimation Vovk et al. (1999), and Saunders et al. (1999) presented approaches to estimating
the confidence of a prediction based on a transductive setting. A similar goal using
a Bayesian approach is pursued by Graepel et al. (2000). Since their primary aim
is not a reduced error rate in general, but a measure of confidence for a particular
prediction, they consider only test sets with exactly one example.
Finding the globally optimal solution is intractable for interestingly sized test sets.
Existing algorithms resort to local search or to relaxing the optimization problem.
More work is needed on tractable formulations and algorithms for transductive
learning, as well as a deeper theoretical and empirical understanding of its potential.
7 Semi-Supervised Learning Using Semi-
Definite Programming
In transduction problems, we are provided with a set of labeled data points (training
set), as well as a set of unlabeled data points (test set). Our interest is to find
suitable labels for the second set, with no immediate ambition to make predictions
120 Semi-Supervised Learning Using Semi-Definite Programming
for yet unseen data points that may become available later on. The way the SVM
transduction problems handle this is by finding those test set labels for which, after
training an SVM on the combined training and test set, the margin on the full data
set is maximal. This involves optimizing over all labelings of the test set an integer
programming problem with exponential cost.
Primal Let us recall the primal soft-margin SVM problem (see e.g. (Cristianini
and Shawe-Taylor, 2000) and (Shawe-Taylor and Cristianini, 2004) for an introduc-
tion to SVMs and kernel methods):
l
1 T
minξi ,w w w+C ξi
2 i=1
s.t. yi wT xi ≥ 1 − ξi
ξi ≥ 0.
We omitted the bias term here, as we will do throughout the entire chapter. This is
not a problem, as argued in (Poggio et al., 2001). Only the labeled data points are
involved in this optimization problem. Then, the transductive SVM can be written
primal as
transductive n
1 T
SVM formulation minξi ,w,Yu w w+C ξi
2 i=1
s.t. yi wT xi ≥ 1 − ξi
ξi ≥ 0
Yu ∈ {−1, 1}u, (7.1)
where we used the notation Yu = (yl+1 , . . . yn ) for the set of test set labels, a column
vector containing the labels for the test points, and n = l + u for the total number
of training and test points. It is the combinatorial constraint 7.1 that makes this
optimization problem very hard to solve exactly.
All constraints are now linear (matrix) inequalities, and the objective is linear in Γ
and concave in α. However, the problem is still an integer program due to constraint
7.3 and hence the overall problem is not convex.
We will write the label matrix Γ as a block matrix using the notation
Γll Γlu Yl YlT Yl YuT
Γ= = .
Γul Γuu Yu YlT Yu YuT
Symmetry constraints such as Γuu = ΓTuu and Γlu = ΓTul are understood and we will
never mention them explicitly. Now, observe that any matrix of rank 1 with ones
on the diagonal can be written as an outer product of a vector with itself where
this vector only contains 1 and −1 as its elements. Thus, the following proposition
holds:
Proposition 7.1 We can reformulate the constraints (7.2) and (7.3) by the equiv-
122 Semi-Supervised Learning Using Semi-Definite Programming
diag (Γ) = 1
rank (Γ) = 1
Yl YlT Γlu
Γ = .
Γul Γuu
These constraints are linear in the parameters, except for the rank constraint, which
is clearly nonconvex (indeed, a convex combination of two matrices of rank 1 will
generally be of rank 2). To deal with this problem, in this chapter we propose to
relax the constraint set by extending the feasible region to a convex set over which
optimization can be accomplished in a reasonable computation time. To retain a
good performance, it should not be much larger than the nonconvex set specified
by the constraints above.
Note that the constraints imply that the matrix Γ is positive semi-definite (PSD).
So, we can add Γ 0 as an additional constraint without modifying the problem.
The relaxation then consists in simply dropping the rank constraint.1 The resulting
relaxed optimization problem is
Of course, the rank of the resulting optimal matrix Γ will not necessarily be equal
to 1 anymore, and its entries not equal to 1 and −1. However, we can see that each
entry of Γ will still lie in the interval [−1, 1]. Indeed, since all principal submatrices
of a PSD matrix have to be PSD as well, every 2 × 2 principal submatrix has to be
PSD, which for a matrix containing ones on its diagonal can only be achieved for
off-diagonal elements in [−1, 1]. Furthermore:
1. Ideally, we should relax the constraints so as to extend the feasible region to just the
convex hull of the constraints, which is the smallest convex set containing the feasible
region of the original problem. For a label matrix Y Y T with Y ∈ {−1, 1}n , this convex
hull is referred to as the cut polytope. However, no efficient description of the cut polytope is
known. Hence, one has to resort to convex relaxations of the cut polytope itself, such as the
elliptope, which is essentially the relaxation used in this chapter. Other relaxations of the
cut polytope are known (such as the metric polytope), and they can be used alternatively
or in addition. Tighter relaxations tend to be computationally more challenging, though,
and for brevity we will not consider these here. For more information we refer the reader
to (Helmberg, 2000; Anjos, 2001).
7.1 Relaxing SVM Transduction 123
minΓ f (Γ)
s.t. diag (Γ) = 1
Γ0
Yl YlT Γlu
Γ= .
Γul Γuu
Let us first concentrate on f (Γ). For a given Γ 0, the objective is concave and
the constraints are all linear, i.e., we have a convex optimization problem. One can
easily verify Slater’s constraint qualification (the existence of a strictly feasible point
in the constraint set, see e.g. (Anjos, 2001)), showing that strong duality holds. Let
us now write the dual optimization problem by using Lagrange multipliers 2µ ≥ 0
and 2ν ≥ 0 for the inequality constraints C ≥ αi and αi ≥ 0 respectively (the
factor 2 in front of µ and ν is used for notational convenience). By invoking strong
duality, which states that the dual optimum is equal to the primal optimum, we
can now write f (Γ) as
f (Γ) = minµ,ν,t t
s.t. µ≥0
ν≥0
t ≥ (1 − µ + ν)T (K ⊙ Γ)† (1 − µ + ν) + 2CµT 1.
Using the extended Schur complement lemma (see appendix), we can rewrite the
latter constraint as
K ⊙Γ (1 − µ + ν)
0,
(1 − µ + ν)T t − 2CµT 1
minΓ,µ,ν,t t (7.4)
s.t. µ≥0
ν≥0
diag (Γ) = 1
Γ0
K ⊙Γ (1 − µ + ν)
0
(1 − µ + ν)T t − 2CµT 1
Yl YlT Γlu
Γ= . (7.5)
Γul Γuu
This is a convex optimization problem that is solvable in polynomial time (see e.g.
(Nesterov and Nemirovsky, 1994; Vandenberghe and Boyd, 1996)).
Proof From the extended Schur complement lemma it follows that the column
space of Γlu should be orthogonal to the null space of Yl YlT . This can only be if
Γlu = Yl γuT for some vector γu .
7.1 Relaxing SVM Transduction 125
1 γuT
Proposition 7.4 The constraint Γ 0 is equivalent with 0.
γu Γuu
Proof We use the fact that a principal submatrix of a PSD matrix is PSD as well
(Horn and Johnson, 1985). By taking a principal submatrix of Γ containing exactly
one row and the corresponding column among the first l,and all of the last u rows
1 γuT
and columns, we can see that Γ 0 implies 0. On the other hand,
γu Γuu
1 γuT
from 0 we get
γu Γuu
T
Yl 0 1 γuT Yl 0 Yl YlT Yl γuT
= = Γ 0.
0 I γu Γuu 0 I γu YlT Γuu
Thus, the final formulation of the relaxed SVM transduction problem is given by
Here we would like to point out that the equality constraint on the diagonal
can also be turned into an inequality constraint without affecting the solution:
diag (Γuu ) ≤ 1. Indeed, if the diagonal were lower than 1, we could simply increase
it without affecting the constraints or increasing the objective. We will use this fact
later in this chapter.
As noted earlier, the optimal value for Γ may have a rank different from 1. So it
does not provide us with a direct estimate for the label vector Yu . However, the
126 Semi-Supervised Learning Using Semi-Definite Programming
previous section gives us a hint of what a suitable estimate for it can be: it is given
by simply one of the columns of Γ corresponding to a positively labeled training
point. In other words, we propose to take the (thresholded) vector γu as an estimate
for the optimal test label vector.
Other approaches are possible, such as taking the dominant eigenvector of Γ,
or using a randomized approach. For more information on such methods, see
(Helmberg, 2000).
In most practical cases, the computational complexity of this relaxation is still too
high. In this section we present an approximation technique that will allow for a
considerable speedup of the method at the cost of a reasonable performance loss.
It is notable that this technique may have wider applicability to speed up convex
relaxations of combinatorial problems, such as for the max-cut problem (see e.g.
(Helmberg, 2000)).
Let us assume for a moment that we can come up with a d-dimensional subspace
of Rn that contains the optimal label vector Y . We represent this subspace by
the columns of the matrix V ∈ Rn×d which form a basis for it. Then the optimal
label matrix Γ = Y Y T can be represented as Γ = VMVT , with M ∈ Rd×d , a
symmetric matrix of rank 1. Our relaxation of the rank constraint on Γ to an
SDP constraint then translates in an analogous relaxation on M. The resulting
optimization problem can be obtained by simply replacing all occurrences of Γ
with VMVT in Eqs. 7.4 and 7.5 and optimizing over M instead of over Γ:
7.2 An Approximation for Speedup 127
minM,µ,ν,t t
s.t. µ≥0
ν≥0
diag VMVT = 1
M0
K ⊙ VMVT (1 − µ + ν)
0.
(1 − µ + ν)T t − 2CµT 1
and the generalized eigenvectors belonging to the small eigenvalues capture the
cluster structure in the data (which means that a label vector corresponding to
a good clustering of the data is likely to be close to the space spanned by these
generalized eigenvectors).
In order to ensure that the given training label information is respected by this
solution, additional constraints should be imposed on v. This can be achieved
constructively by making use what we call the label constraint matrix L, defined
by
⎛ ⎞
1l+ 1l+ 0
⎜ ⎟
L=⎜ ⎝ 1l− −1l− 0 ⎠ ,
⎟
1u 0 I
where 1l+ and 1l− are vectors containing as many ones as there are positively and,
respectively, negatively, labeled training points, and 1u contains u ones. Using L,
we can constrain v to respect the training label information by parameterizing it
128 Semi-Supervised Learning Using Semi-Definite Programming
and the corresponding constrained solution is v = Lz. For more details about this
method we encourage the reader to consult (De Bie et al., 2004).
A good subspace to which the label vector is likely to be close is then spanned
by the vectors vi = Lzi with zi the generalized eigenvectors of (7.6) corresponding
to the d smallest eigenvalues (except for the one equal to zero). Hence, we can
construct a good matrix V by stacking these vi next to each other.
The constraint on the diagonal diag (Γ) = 1 will in general be infeasible when
using the subspace trick. However, as noted above, we can turn it into an inequality
constraint diag (Γ) ≤ 1 without fundamentally changing the problem. In fact, if the
dimensionality d (i.e., the number of columns) of V were equal to u, there would be
no difference between the optimal solutions obtained with or without the subspace
approximation, as then the entire feasible region of Γ is the same. The diagonal
would then be equal to 1, even if only an inequality constraint is specified.
Thus far we have discussed the transductive setting, which is just one of the semi-
supervised learning tasks described in chapter 1. We will briefly point out, however,
how the technology in this chapter can straightforwardly be extended to deal with
more general settings.
As in (De Bie et al., 2004) and in chapter 5 of this book, we are able to handle
more general semi-supervised learning settings (see also (De Bie et al., 2003) and
7.4 Empirical Results 129
(Shental et al., 2004) where similar constraints are exploited for doing dimension-
ality reduction and in computing a Gaussian mixture model respectively). Imagine
the situation where we are given grouplets of points for which a label vector Yi is
specified. If we allow such grouplets to contain only one data point, we can assume
without loss of generality that each point belongs to exactly one grouplet. The label
vector Yi indicates which points within the grouplet are given to be in the same
class (an equivalence constraint), namely those with the same entry 1 or −1 in Yi ,
and which ones are given to belong to opposite classes (when their entry in Yi is
equivalence and different, an inequivalence constraint). In between different grouplets no informa-
inequivalence tion is given. This means that the overall sign of such a grouplet label vector Yi is
constraints arbitrary.
Then, using similar techniques as we used above, one can show that the label
matrix Γ should be a block matrix, with the diagonal blocks equal to Yi YiT , and
the off-diagonal blocks (i, j) equal to γi,j Yj YiT :
⎛ ⎞
Y1 Y1T γ1,2 Y1 Y2T · · · γ1,k Y1 YkT
⎜ ⎟
⎜ γ2,1 Y2 Y1T Y2 Y2T · · · γ2,k Y2 YkT ⎟
⎜ ⎟
Γ=⎜ .. .. .. .. ⎟.
⎜ . . . . ⎟
⎝ ⎠
γk,1 Yk Y1T γk,2 Yk Y2T · · · Yk YkT
where γi,j = γj,i are the variables over which we have to optimize. Clearly, the label
matrix as in the transduction scenario explained at the beginning of this chapter
is a special case thereof. Now we can also see that the sign of the label vectors Y i
is irrelevant: upon changing the sign of Yi ; the optimal solution will simply change
accordingly by reversing the signs of γi,j and γj,i for all j.
We want to point out that this method makes it possible to tackle the transductive
SVM problem in a hierarchical way. First one can perform a crude clustering of the
data points into many small clusters (grouplets) that respect the training data.
Then, at a second stage, the semi-supervised SVM approach outlined above can be
employed. This may greatly reduce the computational cost of the overall algorithm.
Also here the subspace trick can be applied in a very analogous way. Again we can
rely on the method described in (De Bie et al., 2004), which is also able to deal
with equivalence and inequivalence constraints.
The kernel used in all experiments in this subection is the radial basis function
(RBF) kernel, and the width is set to the average over all data points of the distance
to their closest neighbor. Figure 7.1 shows an artificially constructed example of a
transduction problem solved by the basic SDP relaxation of the transductive SVM.
Only two data points were labeled, one for each of both classes. Clearly, a standard
inductive SVM would fail in this extreme case.
Furthermore, the transductive optimum is so far from the inductive optimum
that a greedy strategy such as SVMlight is bound to get stuck in a local optimum.
Indeed, the norm of the SVM weight vector at the optimal labeling found by the
SDP relaxation is 5.7, and for the SVMlight local optimum it is 7.3. Thus, the
labeling found by the SDP relaxation achieves a larger margin.2 Furthermore, it is
notable that the optimum of the relaxed optimization problem is 35.318608 while
the (inductive) SVM optimum when using the predicted labels for the unlabeled
data points is only slightly larger: 35.318613. This indicates that most likely the
optimal labeling has been found, since the optimal labeling of the SVM optimum
has to lie between these values (see section 7.1.4).
In figure 7.2 we show another artificial example, where the data seem to consist
of five clusters. We labeled six samples, at least one in each of the clusters. Both
the SDP relaxation and SVMlight clearly succeed in assigning the same label to
all data points that are within the same cluster, and consistent with the training
label in that cluster. Figure 7.3 shows the same data set with a different labeling
of the training points. The transductive optimum found by the SDP relaxation is
slightly imbalanced: 38 data points in one class, and 42 in the other. For this reason
SVMlight seems to classify two data points differently, as, by default, it tries to find
a solution with the same proportion of positively versus negatively labeled test
points as in the training set. The norm of the SVM weight vector for the optimal
labeling as found by the SDP relaxation is equal to 5.92, which is slightly smaller
than 5.96, the weight vector norm for the SVMlight solution. Hence also here the
SDP approach achieves a larger margin.
Again, in both cases the lower bound provided by the optimum of the SDP
relaxation supports the conclusion that the optimal labeling has been found. For
the first problem, the optimum of the SDP relaxation is 35.338, while the SVM
optimum for the predicted labels is 35.341. For the second problem, those optima
2. There is a catch in the comparison of both optima: the SDP method does not use an
offset parameter b, whereas SVMlight does include such an offset. The numbers reported
are the weight vector norms when including an offset, hence favoring SVMlight .
7.4 Empirical Results 131
1 1
0.8 0.8
0.6 0.6
0.4 0.4
0.2 0.2
0 0
0. 2 0. 2
0. 4 0. 4
0. 6 0. 6
0. 8 0. 8
1 1
1 0. 5 0 0.5 1 1 0. 5 0 0.5 1
Figure 7.1 The result of the basic SDP relaxation (left) and of SVMlight (right) on an
artificially constructed transduction problem. The ’o’ and ’x’ signs represent the negatively
and positively labeled training points. The other data points are labeled by the algorithms.
The contour lines are drawn for the SVM as trained on the complete set of data points
with labels as determined by the transduction algorithms. The SDP relaxation yields the
desired result, while apparently SVMlight got stuck in a local optimum.
We conducted a few experiments on the constitution data set used in (De Bie and
Cristianini, 2004c). This data set contains 780 articles, an equal number in German,
French, Italian, and English, that are translations of each other. Furthermore, the
articles are organized in so-called Titles. In our experiments, we solved two different
problems: one is the classification of English + French texts versus Italian + English
texts, and the other is the classification of the largest Title (roughly containing half
of all articles) versus the smaller Titles. We tested the SDP relaxation as well as
SVMlight on both problems for different training set sizes, and plot the results in
figure 7.4. The kernel used is the normalized bag of words kernel, and d = 4.
1 1
0.8 0.8
0.6 0.6
0.4 0.4
0.2 0.2
0 0
0. 2 0. 2
0. 4 0. 4
0. 6 0. 6
0. 8 0. 8
1 1
1 0. 5 0 0.5 1 1 0. 5 0 0.5 1
Figure 7.2 The result of the basic SDP relaxation on an artificially constructed trans-
duction problem (left), and the result of SVMlight (right). Here we organized the data
points in a few small clusters. In each of the clusters, one or two samples are labeled (in
total there are six training points). For both methods, the training label determines the
test labels of all data points within the cluster, as is desirable in most applications.
1 1
0.8 0.8
0.6 0.6
0.4 0.4
0.2 0.2
0 0
0. 2 0. 2
0. 4 0. 4
0. 6 0. 6
0. 8 0. 8
1 1
1 0. 5 0 0.5 1 1 0. 5 0 0.5 1
Figure 7.3 SDP transduction (left) and SVMlight (right) are applied to the same data
set as in figure 7.2, now with a different labeling of the training points. If we label the
data points according to the labeled point in the cluster they (visually) belong to, this
transduction problem is slightly unbalanced: one class of 38 points, the other of 42 points.
Since SVMlight fixes the fraction of positively and negatively labeled data points to their
fraction in the training set (by default), two data points are split off the cluster left above
to satisfy this constraint.
7.5 Summary and Outlook 133
0.95
0.9
0.85
0.8
0.75
0.7
0.65
0.6
0.55
0.5
0 0.2 0.4 0.6 0.8 1
Figure 7.4 The receiver operating characteristics (ROC) score evaluated on the test
set, as a function of the size of the training set for both classification problems. The
bold lines are for the easy classification problem classifying languages, and the faint
lines are for the harder classification problem classifying articles according to their
“Title”. The performance of the approximated SDP relaxation is shown in solid lines,
the SVMlight performance in dotted lines. Bars indicate the standard deviation over three
randomizations.
While the empirical results for the relaxation are generally better than with
SVMlight , unfortunately the scalability is still limited. To solve this problem, we
introduced an approximation technique of general applicability in relaxations of
combinatorial problems. The performance of this approximation strongly depends
on the quality of the approximation, and mixed empirical results in comparison
with SVMlight are reported.
Future work includes investigating whether the problem structure can be ex-
ploited to speed up the optimization problem. An important theoretical question
134 Semi-Supervised Learning Using Semi-Definite Programming
that remains unanswered is whether the relaxation allows finding a solution with a
margin that is provably within a fixed constant factor of the unrelaxed optimum.
As we pointed out, the relaxation does provide us with an interval within which
the true optimal solution must lie. However, the size of this interval is not known
a priori as is the case for, e.g., the relaxation of the max-cut problem (see e.g.
(Helmberg, 2000)). Lastly, it would be interesting to investigate theoretically what
the influence is of the subspace approximation on the optimum.
We state the Schur complement lemma without proof (see e.g. (Helmberg, 2000)):
When the matrix A may be rank deficient, the following extended Schur comple-
ment lemma should be used. It is a generalization of the standard Schur complement
Extended Schur lemma. We provide it here with a proof:
complement
lemma Lemma 7.6 (Extended Schur complement lemma) For symmetric matrices
A 0 and C 0:
#
The column space of B ⊥ the null space of A A B
⇔ 0.
C BT A† B BT C
where V0 denotes the singular vectors for the null space of A, V the other singular
vectors, and Λ is a diagonal matrix containing the non-zero singular values of A,
i.e. Λ ≻ 0. The blocks are assumed to be compatible. Similarly, we write the SVD
of C as
∆ 0 T
C = W W0 W W0 = W∆WT .
0 0
Λ BV
0. Left multiplication of both sides of this inequality with
BTV C
V 0 A B
and on the right with its transpose, yields 0.
0 I BT C
(⇐) We will prove the orthogonality of the column space of B with the null space
of A by contradiction. So, assume that the column space of B is not orthogonal to
the null space V0 of A. Then, thereexists a vector
v0 in the spanof V0 for which
A B VΛV B
BT v0 = b = 0. Now, we have that T
= 0. Thus, for
B C BT C
v0
any vector w, multiplying this matrix with on the right and on the left with
w
its transpose must result in a non-negative number: 2bT w + wT Cw ≥ 0. However,
plugging in w = −C† b−W0 W0T b yields 2bT w +wT Cw = −2bT W0 W0T b−bT C† b <
0, and thus we reached a contradiction. So we have established that the column
space of B is orthogonal to the span of V0 .
This means that we can write B as B = VBV for some particular BV , and
T
A B VΛVT VBV V 0 Λ BV V 0
= = ,
BT C BTV VT C 0 I BTV C 0 I
so that:
A B Λ BV
0⇒ 0.
BT C BTV C
8.1 Introduction
Figure 8.1 (Left) The graphical representation of a generative model. (Right) The
graphical representation of a discriminative model.
Figure 8.2 Graphical models for semi-supervised data in the generative framework (left)
and the discriminative framework (right).
Contrast this with the situation in the discriminative model (right panel of
140 Gaussian Processes and the Null-Category Noise Model
figure 8.2). In this case the d-separation criterion shows that θ is independent of
xj for the unlabeled data. The observed value of xj will not have an effect on the
posterior distribution of θ when yj is unobserved.
In the remaining sections of this chapter we will show how the discriminative
model can be augmented to allow it to handle unlabeled data.
In figure 8.2 (right panel) we saw how, in the discriminative approach, an unlabeled
data point fails to influence the posterior for θ and thus the position of the decision
boundary. This is because the unlabeled points and the parameters become d-
separated (independent from one another) when the label yj is unobserved. To
restore the dependence we need to augment the model. This can be done by
introducing an additional variable zi that is a child of yj and is always observed. As
shown in figure 8.3 this breaks the d-separation of xj and θ and allows probabilistic
dependence to flow between these variables—even when yj is unobserved.
labeled is different in the two classes; i.e., if p (zj = 1|yj = 1) = p (zj = 1|yj = −1),
then we have p (zj ) = p (zj |yj ) and zj is effectively decoupled from yj , once again
d-separating θ from xj .
On the other hand, there is no need to restrict ourselves to binary indicator
variables; we can be more clever about the augmentation. The remainder of the
chapter develops the specific augmentation that we propose. As will be seen, our
proposal is similar in spirit to the transductive SVM (see chapter 6); we want to
place the decision boundary in a region of low data density. The assumption that
the interclass regions have lower data density is known as the cluster assumption
(see chapter 1). We will show how an augmented model can capture the spirit of
the cluster assumption—but without implementing an explicit density model.
Our approach is based on the notion of a null-category, a class for which we never
observe any data. The null-category can be viewed as a probabilistic interpretation
of the “margin” in the SVM.1
To simplify our discussion of the null-category noise model, we first introduce
a latent process variable fi . This variable will allow us to discuss the noise model
independently of the “process model.” The latent variable allows the probability of
class membership to decompose as
p (yi |xi ) = p (yi |fi ) p (fi |xi ) dfi ,
where we refer to p (yi |fi ) as the noise model and p (fi |xi ) as the process model.
The null-category noise model derives from the general class of ordered categorical
models (Agresti, 2002). In the specific context of binary classification we will
consider an ordered categorical model containing three categories:
⎧
⎪
⎪ φ − fi + a2 for yi = −1
⎨
p (yi |fi ) = a a ,
φ fi + 2 − φ fi − 2 for yi = 0
⎪
⎪
⎩ φ f − a
for y = 1
i 2 i
x
where φ (x) = −∞ N (z|0, 1) dz is the cumulative Gaussian distribution function
and a is a parameter giving the width of category yi = 0 (see figure 8.4).
1. We are not the first to consider a probabilistic interpretation of the SVM loss function.
Sollich (1999, 2000) treats the margin in terms of a “not sure” class, but this interpretation
suffers from problems of normalization.
142 Gaussian Processes and the Null-Category Noise Model
Figure 8.4 The ordered categorical noise model. The plot shows p (yi |fi ) for different
values of yi . Here we have assumed three categories.
We can also express this model in an equivalent and simpler form by replacing
the cumulative Gaussian distribution by a Heaviside step function,
0 if x < 0
H(x) = ,
1 if x > 0
where we have standardized the width parameter to 1, by assuming that the overall
scale is also handled by the process model.
in other words, a data point cannot be from the category yi = 0 and be unlabeled.
We then parameterize the probabilities of missing labels for the other classes as
p (zi = 1|yi = 1) = γ+ and p (zi = 1|yi = −1) = γ− .
For points where the label is present the latent process is updated as usual
(because zi is d-separated from θ by yi ). When the data point’s label is missing,
8.3 Process Model and Effect of the Null-Category 143
By marginalizing across yi when the label is missing and otherwise using the
effective standard likelihood, we recover the “effective likelihood function” for a single data
likelihood point, L (fi ). It takes one of three forms:
function
⎧
⎪
⎪ H − fi + 1
for yi = −1, zi = 0
⎨ 2
L (fi ) = 1 1 .
γ− H − fi + 2 + γ+ H f i − 2 for zi = 1
⎪
⎪
⎩ H fi − 1
for yi = 1, zi = 0
2
The constraint imposed by (8.1) implies that an unlabeled data point never
comes from the class yi = 0. Since yi = 0 lies between the labeled classes this
is equivalent to a hard assumption that no data come from the region around
the decision boundary. We can also soften this hard assumption, if so desired, by
injection of noise into the process model. If we also assume that our labeled data
only come from the classes yi = 1 and yi = −1 we will never obtain any evidence
for data with yi = 0; for this reason we refer to this category as the null-category
and the overall model as a null-category noise model (NCNM).
The noise model we have described can be used within a range of optimization
frameworks. Indeed, viewing the noise model as a probabilistic interpretation of
the SVM’s margin, if we specify
fi = wT xi ,
p (w) = N (w|0, I) ,
and let z have a multivariate Gaussian distribution with mean m and covariance
Σ,
1 1 T −1
N (z|m, Σ) = d 1 exp − (z − m) Σ (z − m) ,
(2π) 2 |Σ| 2 2
SVM as MAP then the maximum a posteriori (MAP) solution for w is given by the linear SVM
solution algorithm. Naturally fi can then be “kernelized” and the MAP solution for the
model becomes equivalent to the nonlinear SVM. However, in this domain the
meaning of a prior distribution over w is not entirely clear, and it is generally more
convenient to consider a process prior over fi . As is well known, the process prior
144 Gaussian Processes and the Null-Category Noise Model
which leads to the SVM as a MAP solution is the Gaussian process prior (for two
useful reviews of Gaussian processes see O’Hagan (1992); Williams (1998)). Under
the Gaussian process prior the values {fi } are jointly distributed as a zero-mean
Gaussian distribution with covariance given by the kernel matrix K.
In the remainder of this chapter we will consider the use of a Gaussian process prior
over fi . The algorithms we consider update the process posterior in a sequential
manner, incorporating a single data point at a time. It is therefore sufficient to
consider a univariate distribution over fi given xi , of the form
where the mean μ (xi ) and the variance ς (xi ) are functions of the covariate xi .
A natural consideration in this setting is the effect of our likelihood function on
the distribution over fi when incorporating a new data point. As we have already
mentioned, if we observe yi , then the parameters are d-separated from zi . In this
effect on case the effect of the likelihood on the posterior process will be similar to that
posterior incurred in binary classification, in that the posterior will be a convolution of the
step function and a Gaussian distribution. However, when the data point’s label is
missing the effect will depend on the mean and variance of p (fi |xi ). If this Gaussian
has little mass in the null-category region (i.e., the region between the classes), the
posterior will be similar to the prior. However, if the Gaussian has significant mass
in the null-category region, the outcome may be loosely described in two ways:
1. If p (fi |xi ) “spans the likelihood,” figure 8.5 (left), then the mass of the posterior
can be apportioned to either side of the null-category region, leading to a bimodal
posterior. The variance of the posterior will be greater than the variance of the prior,
a consequence of the fact that the effective likelihood function is not log-concave
(as can be easily verified).
2. If p (fi |xi ) is “rectified by the likelihood,” figure 8.5 (right), then the mass of the
posterior will be pushed into one side of the null-category and the variance of the
posterior will be smaller than the variance of the prior.
Note that for all situations in which a portion of the mass of the prior distribution
falls within the null-category region it is pushed out to one side or both sides. The
intuition behind the two situations is that in case 1, it is not clear what label the
data point has, but it is clear that it shouldn’t be where it currently is (in the
null-category). The result is that the process variance increases. In case 2 the data
point is being assigned a label and the decision boundary is pushed to one side of
the point so that it is classified according to the assigned label.
In figure 8.6, we demonstrate the effect of the null-category. We sampled a vector
500
(fi )i=1 from a Gaussian process with an radial basis function (RBF) kernel. The
500
covariates (xi )i=1 were sampled uniformly from the two-dimensional unit square.
8.4 Posterior Inference and Prediction 145
Figure 8.5 Two situations of interest. Diagrams show the prior distribution over fi
(long dashes), the effective likelihood function from the noise model when zi = 1 (short
dashes), and a schematic of the resulting posterior over fi (solid line). (Left) The posterior
is bimodal and has a larger variance than the prior. (Right) The posterior has one dominant
mode and a lower variance than the prior. In both cases the process is pushed away from
the null-category.
1 1
0.8 0.8
0.6 0.6
0.4 0.4
0.2 0.2
0 0
0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1
Figure 8.6 Samples from a standard Gaussian process classifier with a probit noise
model (left) and a Gaussian process with the null-category noise model (right). The
covariate vectors were originally sampled uniformly from the unit square. The null-
category noise model has the effect of reducing the data density in the region of the
decision boundary.
In the left panel, points were assigned to the class of yi = 1 with probability
φ (fi ) and were otherwise assigned a class of yi = −1. In the right panel they were
assigned the class yi = 1 with probability φ fi − 21 and the class yi = −1 with
probability φ −fi − 12 ; all other points were assumed to have come from the null
category and were removed. Note that this rejection of points has the effect of
reducing the data density near the decision boundary.
Broadly speaking, the effects discussed above are independent of the process model:
the effective likelihood will always force the latent function away from the null-
146 Gaussian Processes and the Null-Category Noise Model
category. To implement our model, however, we must choose a specific process model
and inference method. The nature of the noise model means that it is unlikely that
we will find a nontrivial process model for which inference (in terms of marginalizing
fi ) will be tractable. We therefore turn to approximations which are inspired by
“assumed density filtering” (ADF) methods; see, e.g., Csató (2002). The idea in
ADF is to approximate the (generally non-Gaussian) posterior with a Gaussian by
matching the moments between the approximation and the true posterior. ADF
has also been extended to allow each approximation to be revisited and improved
as the posterior distribution evolves (Minka, 2001).
One further complication is that the “effective likelihood” associated with the
null-category noise model is not log-concave. The implication of this is that the
variance of the posterior process can increase when a point is included. This
situation is depicted in figure 8.5 (left); the posterior depicted in this plot has
a larger variance than the prior distribution. This increase in variance is difficult to
accommodate within the ADF approximation framework and in our implementation
it was ignored.
One important advantage of the Gaussian process framework is that it is
learning the amenable to an empirical Bayesian treatment—the hyperparameters in the covari-
hyperparameters ance function can be learned by optimizing the marginal likelihood. In practice,
however, if the process variance is maximized in an unconstrained manner the ef-
fective width of the null-category can be driven to zero, yielding a model that is
equivalent to a standard binary classification noise model. The process variance
controls the scale of the function. If the process variance is allowed to grow in an
unconstrained manner the effective width of the null-category region becomes zero,
removing any effect from unlabeled data points. To prevent this from happening
we regularize by imposing an L1 penalty on the process variances (this is equiva-
lent to placing an exponential prior on those parameters). The L1 penalty prefers
smaller process variances thereby increasing the effective width of the null-category
region. The model therefore prefers a large null-category region. This is analogous
to maximizing the margin in a support vector machine.
Once the parameters of the process model have been learned, we wish to make
predictions about a new test-point x∗ via the marginal distribution p (y∗ |x∗ ). For
the NCNM an issue arises here: this distribution will have a non-zero probability of
y∗ = 0, a label that does not exist in either our labeled or unlabeled data. This is
where the role of z∗ becomes essential. The new point also has z∗ = 1 so in reality
the probability that a data point is from the positive class is given by
8.5 Results
Sparse representations of the training data are essential for speeding up the process
of learning. We made use of the informative vector machine (IVM) approach in
which the data are sparsified via a sequential greedy method in which points are
placed in an active set according to information-theoretic criteria. This approach
provides an approximation to a full Gaussian process classifier which is competitive
with the SVM in terms of speed and accuracy. The IVM also enables efficient
learning of kernel hyperparameters, and we made use of this feature in all of our
experiments.
10 10
5 5
0 0
5 5
10 10
10 5 0 5 10 10 5 0 5 10
Figure 8.7 Results from the toy problem. There are 400 points, which have probability
0.1 of receiving a label. Labeled data points are shown as circles and crosses. Data points
in the active set are shown as large dots. All other data points are shown as small dots.
(Left) Learning on the labeled data with the IVM algorithm. All labeled points are used
in the active set. (Right) Learning on the labeled and unlabeled data with the NCNM.
There are 100 points in the active set. In both plots decision boundaries are shown as
a solid line; dotted lines represent contours within 0.5 of the decision boundary (for the
NCNM this is the edge of the null-category).
148 Gaussian Processes and the Null-Category Noise Model
where δnm is the Kronecker delta function. The parameters of the kernel were
learned by performing type II maximum likelihood over the active set. Since active
set selection causes the marginalized likelihood to fluctuate it cannot be used to
monitor convergence; we therefore simply iterated fifteen times between active set
selection and kernel parameter optimization. The parameters of the noise model,
{γ+ , γ− }, can also be optimized, but note that if we constrain γ+ = γ− = γ, then
the likelihood is maximized by setting γ to the proportion of the training set that
is unlabeled.
We first considered an illustrative toy problem to demonstrate the capabilities
of our model. We generated two-dimensional data in which two class-conditional
densities interlock. There were 400 points in the original data set. Each point was
assigned a label with probability 0.1, leading to 37 labeled points. First a standard
IVM classifier was trained on the labeled data only (figure 8.7, left). We then used
the null-category approach to train a classifier that incorporates the unlabeled data.
As shown in figure 8.7 (right), the resulting decision boundary finds a region of low
data density and more accurately reflects the underlying data distribution.
We next considered the null-category noise model for learning of the USPS hand-
written digit data set. This data set is fully labeled, but we can ignore a proportion
of the labels and treat the data set as a semi-supervised task. In the experiments
that followed we used an RBF kernel with a linear component. We ran each experi-
ment ten times, randomly selecting the data points that were labeled. The fraction
of labeled points, r, was varied between 0.01 and 0.25. Each digit was treated as a
separate “one against the others” binary classification class. We also summarized
these binary classification tasks with an overall error rate by allocating each test
data point to the class with the highest probability. In the first of our experiments,
we attempted to learn the parameters of the kernel by maximizing the IVM’s ap-
proximation to the marginal likelihood. The results are summarized in table 8.1.
As can be seen in the table, good classification results are obtained for values of
r above 0.1, but poor results are obtained for values of r below 0.1. This appears
troublesome at first sight, given that many semi-supervised learning algorithms give
reasonable performance even when the proportion of unlabeled data is as low as
0.1. It must be borne in mind, however, that the algorithm presented here faces the
additional burden of learning the kernel hyperparameters. Most other approaches
do not have this capability and therefore results are typically reported for a given,
tuned set of kernel parameters. To make a more direct comparison we also undertook
experiments in which the kernel hyperparameters were fixed to the values found by
an IVM trained on the fully labeled data set. These results are summarized in
8.6 Discussion 149
Table 8.1 Table of results for semi-supervised learning on the USPS digit data.
r 0 1 2 3 4
0.010 18 ± 0.0 8.0 ± 6.5 9.9 ± 0.0 8.3 ± 0.0 10 ± 0.0
0.025 11 ± 8.8 0.98 ± 0.1 9.9 ± 0.0 6.5 ± 2.4 10 ± 0.0
0.050 1.7 ± 0.2 1.0 ± 0.1 3.7 ± 0.4 5.4 ± 2.7 7.4 ± 3.5
0.10 1.7 ± 0.1 0.95 ± 0.1 3.2 ± 0.2 3.2 ± 0.3 3.3 ± 0.3
0.25 1.6 ± 0.2 0.97 ± 0.1 2.5 ± 0.2 2.9 ± 0.2 2.8 ± 0.1
r 5 6 7 8 9 Overall
0.010 8.0 ± 0.0 8.5 ± 0.0 7.3 ± 0.0 8.3 ± 0.0 8.8 ± 0.0 83 ± 7.3
0.025 8.0 ± 0.0 8.5 ± 0.0 7.3 ± 0.0 8.3 ± 0.0 8.8 ± 0.0 64 ± 5.0
0.05 7.1 ± 1.9 1.7 ± 0.2 7.3 ± 0.0 7.4 ± 1.9 7.6 ± 2.7 33 ± 7.2
0.1 3.0 ± 0.3 1.5 ± 0.1 1.3 ± 0.1 3.4 ± 0.2 2.0 ± 0.3 7.7 ± 0.2
0.25 2.4 ± 0.2 1.3 ± 0.2 1.2 ± 0.1 2.6 ± 0.3 1.6 ± 0.2 6.4 ± 0.2
For these results the model learned the kernel parameters. We give the results for the individual
binary classification tasks and the overall error computed from the combined classifiers. Each
result is summarized by the mean and standard deviation of the percent classification error across
ten runs with different random seeds.
table 8.2. As expected these results are much better in the range where r < 0.1.
With the exception of the digit 2 at r = 0.01 a sensible decision boundary was
learned for at least one of the runs even when r = 0.01.
8.6 Discussion
Table 8.2 Table of results for semi-supervised learning on the USPS digit data.
r 0 1 2 3 4
0.010 3.2 ± 5.2 13 ± 14 9.9 ± 0.0 3.1 ± 0.2 8.3 ± 2.6
0.025 1.5 ± 0.2 1.5 ± 0.9 5.2 ± 2.0 2.9 ± 0.2 4.4 ± 2.1
0.050 1.5 ± 0.2 1.2 ± 0.2 3.4 ± 0.4 2.9 ± 0.1 3.3 ± 0.2
0.10 1.5 ± 0.1 1.2 ± 0.1 2.8 ± 0.2 2.8 ± 0.2 3.0 ± 0.2
0.25 1.4 ± 0.2 1.3 ± 0.2 2.4 ± 0.2 2.6 ± 0.2 2.8 ± 0.2
r 5 6 7 8 9 Overall
0.010 7.5 ± 1.0 7.7 ± 8.5 12 ± 17 7.5 ± 1.2 35 ± 23 42 ± 10
0.025 5.0 ± 1.3 1.6 ± 0.2 1.9 ± 1.9 4.3 ± 0.5 9.9 ± 8.5 14 ± 6.1
0.050 3.6 ± 0.6 1.5 ± 0.1 1.3 ± 0.1 4.1 ± 0.4 2.6 ± 1.3 8.4 ± 0.7
0.10 2.8 ± 0.2 1.3 ± 0.1 1.3 ± 0.1 3.5 ± 0.3 2.0 ± 0.2 7.2 ± 0.5
0.25 2.3 ± 0.2 1.2 ± 0.1 1.2 ± 0.1 2.7 ± 0.2 1.6 ± 0.2 6.1 ± 0.4
For these results the model was given the kernel parameters learned by the IVM on the standard
fully labeled data. We give the results for the individual binary classification tasks and the overall
error computed from the combined classifiers.
9.1 Introduction
predefined patterns.
In the probabilistic framework, semi-supervised induction is a missing data
problem, which can be addressed by generative methods such as mixture models
thanks to the EM algorithm and extensions thereof (McLachlan, 1992). Generative
models apply to the joint density of patterns x and class y. They have appealing
features, but they also have major drawbacks. First, the modeling effort is much
more demanding than for discriminative methods, since the model of p(x, y) is
necessarily more complex than the model of P (y|x). Being more precise, the
generative model is also more likely to be misspecified. Second, the fitness measure
is not discriminative, so that better models are not necessarily better predictors of
class labels. These issues are addressed in chapters 2 and 4.
These difficulties have led to proposals where unlabeled data are processed
by supervised classification algorithms. Here, we describe an estimation principle
applicable to any probabilistic classifier, aiming at making the most of unlabeled
data when they should be beneficial to the learning process, that is, when classes are
well apart. The method enables control of the contribution of unlabeled examples,
thus providing robustness with respect to the violation of the postulated low-density
separation assumption.
Section 9.2 motivates the estimation criterion. It is followed by the description
of the optimization algorithms in section 9.3. The connections with some other
principles or algorithms are then detailed in section 9.4. Finally, the experiments of
section 9.5 offer a test bed to evaluate the behavior of entropy regularization, with
comparisons to generative models and manifold learning.
In this section, we first show that unlabeled data do not contribute to the maximum-
likelihood estimation of discriminative models. The belief that “unlabeled data
should be informative” should then be encoded as a prior to modify the estimation
process. We argue that assuming high entropy for P (y|x) is a sensible encoding
of this belief, and finally we describe the learning criterion derived from this
assumption.
9.2.1 Likelihood
is independent from the missing class information. Let h be the random variable
encoding missingness: h = 1 if y is hidden and h = 0 if y is observed. The missing
missing at at random assumption reads
random
P (h|x, y) = P (h|x) . (9.1)
This assumption excludes cases where missingness may indicate a preference for a
particular class (this can happen, for example, in opinion polls where the “refuse
to answer” option may hide an inclination toward a shameful answer). Assuming
independent examples, the conditional log likelihood is then
l
n
L(θ; Ln ) = ln P (yi |xi ; θ) + ln P (hi |xi ). (9.2)
i=1 i=1
Theory provides little support for the numerous experimental evidence showing
that unlabeled examples can help the learning process. Learning theory is mostly
developed at the two extremes of the statistical paradigm: in parametric statistics
where examples are known to be generated from a known class of distribution,
and in the distribution-free structural risk minimization (SRM) or probably ap-
proximately correct (PAC) frameworks. Semi-supervised induction does not fit the
distribution-free frameworks: no positive statement can be made without distribu-
tional assumptions, as for some distributions p(x, y), unlabeled data are noninfor-
mative while supervised learning is an easy task. In this regard, generalizing from
labeled and unlabeled data may differ from transductive inference.
In parametric statistics, theory has shown the benefit of unlabeled examples,
information either for specific distributions (O’Neill, 1978), or for mixtures of the form p(x) =
content of πp(x|y = 1)+(1−π)p(x|y = 2), where the estimation problem is essentially reduced
unlabeled to the one of estimating the mixture parameter π (Castelli and Cover, 1996). These
examples studies conclude that the (asymptotic) information content of unlabeled examples
154 Entropy Regularization
There are many possible measures of class overlap. We chose Shannon’s conditional
entropy, which is invariant to the parameterization of the model, but the framework
developed below could be applied to other measures of class overlap, such as Renyi
entropies. Note, however, that the algorithms detailed in section 9.3.1 are specific to
conditional this choice. Obviously, the conditional entropy may only be related to the usefulness
entropy of unlabeled data where labeling is indeed ambiguous. Hence, the measure of class
overlap should be conditioned on missingness:
1. This statement, given explicitly by O’Neill (1978), is also formalized, though not
stressed, by Castelli and Cover (1996), where the Fisher information for unlabeled ex-
amples at the estimate π̂ is clearly a measure of the overlap between class-conditional
R (p(x|y=1)−p(x|y=2))2
densities: Iu (π̂) = π̂p(x|y=1)+(1−π̂)p(x|y=2) dx.
2. Here, maximum entropy refers to the construction principle which enables derivation
of distributions from constraints, not to the content of priors regarding entropy.
9.3 Optimization Algorithms 155
The MAP estimate is defined as the maximizer of the posterior distribution, that
is, the maximizer of
where the constant terms in the log likelihood (9.2) and log prior (9.4) have been
dropped.
While L(θ; Ln ) is only sensitive to labeled data, Hemp (y|x, h = 1; Ln ) is only
affected by the value of P (m|x; θ) on unlabeled data. Since these two components of
the learning criterion are concave in P (m|x; θ), their weighted difference is usually
not concave, except for λ = 0. Hence, the optimization surface is expected to
possess local maxima, which are likely to be more numerous as u and λ grow. Semi-
supervised induction is halfway between classification and clustering; hence, the
progressive loss of concavity in the shift from supervised to unsupervised learning
is not surprising, as most clustering optimization problems are nonconvex (Rose
et al., 1990).
The empirical approximation Hemp (9.5) of H (9.3) breaks down for wiggly
functions P (m|·) with abrupt changes between data points (where p(x) is bounded
from below). As a result, it is important to constrain P (m|·) in order to enforce the
closeness of the two functionals. In the following experimental section, we imposed
such a constraint on P (m|·) by adding a smoothness penalty to the criterion C
(9.7). Note that this penalty also provides a means to control the capacity of the
classifier.
which distributes the probability mass according to the current estimated posterior
P (m|·) (for labeled examples, the assignment is clamped at the original label
gm (xi ; θ) = δmyi ). For 0 < λ ≤ 1, the Gibbs distribution is more peaked than
the estimated posterior. One recovers EM for λ = 0, and the hard assignments of
classification EM (CEM) (Celeux and Govaert, 1992) correspond to λ = 1.
The M step then consists in maximizing the expected log likelihood with respect
9.3 Optimization Algorithms 157
to θ,
n
M
θs+1 = arg max gm (xi ; θs ) ln P (m|xi ; θ) , (9.8)
θ
i=1 m=1
where the expectation is taken with respect to the distribution (g1 (·; θs ), . . . , gM (·; θs )),
and θs is the current estimate of θ.
The optimization problem (9.8) is concave in P (m|x; θ) and also in θ for logistic
regression models. Hence it can be solved by a second-order optimization algorithm,
such as the Newton-Raphson algorithm, which is often referred to as iteratively
IRLS reweighted least squares, or IRLS in statistical textbooks (Friedman et al., 2000).
We omit the detailed derivation of IRLS, and provide only the update equation
for θ in the standard logistic regression model for binary classification problems. 3
The model of posterior distribution is defined as
1
P (1|x; θ) = , (9.9)
1 + e−(w⊤ x+b)
where θ = (w, b). In the binary classification problem, the M-step (9.8) reduces to
n
s+1
θ = arg max g1 (xi ; θs ) ln P (1|xi ; θ) + (1 − g1 (xi ; θs )) ln(1 − P (1|xi ; θ)) ,
θ
i=1
where
1
P (1|xi ; θ) 1−λ
g1 (xi ; θ) = 1 1
P (1|xi ; θ) 1−λ + (1 − P (1|xi ; θ)) 1−λ
for unlabeled data and g1 (xi ; θ) = δ1yi for labeled examples. Let pθ and g denote
the vector of P (1|xi ; θ) and g1 (xi ; θs ) values respectively, X the (n × (d + 1)) matrix
of xi values concatenated with the vector 1, and Wθ the (n × n) diagonal matrix
with ith diagonal entry P (1|xi ; θ)(1 − P (1|xi ; θ)). The Newton-Raphson update is
−1
θ ← θ + X⊤ Wθ X X⊤ (g − pθ ) . (9.10)
Depending on how P (y|x) is modeled, the M step (9.8) may not be concave, and
other gradient-based optimization algorithms should be used. Even in the case
where a logistic regression model is used, conjugate gradient may turn out being
computationally more efficient than the IRLS procedure. Indeed, even if each M step
of the deterministic annealing EM algorithm consists in solving a convex problem,
this problem is nonquadratic. IRLS solves exactly each quadratic subproblem, a
strategy which becomes computationally expensive for high-dimensional data or
kernelized logistic regression. The approximate solution provided by a few steps of
conjugate gradient may turn out to be more efficient, especially since the solution
θs+1 returned at the sth M step is not required to be accurate.
Depending on whether memory is an issue or not, conjugate gradient updates
may use the optimal steps computed from the Hessian, or approximations returned
by a line search. These alternatives have experimentally been shown to be much
more efficient than IRLS on large problems (Komarek and Moore, 2003).
Finally, when EM does not provide a useful decomposition of the learning task,
one can directly address the minimization of the learning criterion (9.7) with
conjugate gradient, or other gradient-based algorithms. Here also, it is useful to
define an annealing scheme, where λ is gradually increased from 0 to 1, in order to
avoid poor local maxima of the optimization surface.
Minimum entropy regularizers have been used in other contexts to encode learn-
ability priors (Brand, 1999). In a sense, Hemp can be seen as a poor man’s way to
generalize this approach to continuous input spaces. This empirical functional was
also used as a criterion to learn scale parameters in the context of transductive man-
ifold learning (Zhu et al., 2003b). During learning, discrepancies between H (9.3)
and Hemp (9.5) are prevented to avoid hard unstable solutions by smoothing the
estimate of posterior probabilities.
9.4.3 Self-Training
Maximal margin separators are theoretically well-founded models which have shown
great success in supervised classification. For linearly separable data, they have been
shown to be a limiting case of probabilistic hyperplane separators (Tong and Koller,
2000).
In the framework of transductive learning, Vapnik (1998) proposed broaden-
ing the margin definition to unlabeled examples, by taking the smallest Euclidean
distance between any (labeled and unlabeled) training point to the classification
boundary. The following theorem, whose proof is given in the appendix, general-
izes theorem 5, corollary 6 of Tong and Koller (2000) to the margin defined in
160 Entropy Regularization
Theorem 9.1 Consider the two-class linear classification problem with linearly
separable labeled examples, where the classifier is obtained by optimizing
⊤
P (1|x; (w, b)) = 1/(1 + e−(w x+b) ) with the semi-supervised minimum entropy cri-
terion (9.7), under the constraint that ||w|| ≤ B. The margin of that linear classifier
converges toward the maximum possible margin among all such linear classifiers,
as the bound B goes to infinity.
Hence, the minimum entropy solution can approach semi-supervised SVM (Vap-
nik, 1998; Bennett and Demiriz, 1999). We, however, recall that the MAP criterion
is not concave in P (m|x; θ), so that the convergence toward the global maximum
cannot be guaranteed with the algorithms presented in section 9.3. This problem is
shared by all inductive semi-supervised algorithms dealing with a large number of
unlabeled data in reasonable time, such as mixture models or the transductive SVM
of Joachims (1999). Explicitly or implicitly, inductive semi-supervised algorithms
impute labels which are somehow consistent with the decision rule returned by the
learning algorithm. The enumeration of all possible configurations is only avoided
thanks to a heuristic process, such as deterministic annealing, which may fail.
Most graph-based transduction algorithms avoid this enumeration problem be-
cause their labeling process is not required to comply with a parameterized deci-
sion rule. This clear computational advantage has, however, its counterpart: label
propagation is performed via a user-defined similarity measure. The selection of
a discriminant similarity measure is thus left to the user, or to an outer loop, in
which case the overall optimization process is not convex anymore. The experimen-
tal section below illustrates that the choice of discriminant similarity measures is
difficult in high-dimensional spaces, and when a priori similar patterns should be
discriminated.
9.5 Experiments
4. That is, the margin on an unlabeled example is defined as the absolute value of the
margin on a labeled example at the same location.
9.5 Experiments 161
The former shows what has been gained by handling unlabeled data, and the latter
provides the “crystal ball” ultimate performance obtained by guessing correctly all
labels. All hyperparameters (weight-decay for all logistic regression models plus the
λ parameter (9.7) for minimum entropy) are tuned by tenfold cross-validation.
These discriminative methods are compared to generative models. Throughout
all experiments, a two-components Gaussian mixture model was fitted by the EM
algorithm (two means and one common covariance matrix estimated by maximum
likelihood on labeled and unlabeled examples (McLachlan, 1992)). The problem
of local maxima in the likelihood surface is artificially avoided by initializing
EM with the parameters of the true distribution when the latter is truly a two-
component Gaussian mixture, or with maximum likelihood parameters on the
(fully labeled) test sample when the distribution departs from the model. This
initialization advantages EM, which is guaranteed to pick, among all local maxima
of the likelihood, the one which is in the basin of attraction of the optimal
value. In particular, this initialization prevents interferences that may result from
the “pseudolabels” given to unlabeled examples at the first E step. The “label
switching” problem (badly labeled clusters) is prevented at this stage.
3.5
40
Relative improvement
3
20 2
1.5
10
1
5 10 15 20 5 10 15 20
Bayes Error (%) Bayes Error (%)
Figure 9.1 (Left): Test error of minimum entropy logistic regression (◦) and mixture
models (+) versus Bayes error rate for u/l = 10. The errors of logistic regression (dashed),
and logistic regression with all labels known (dash-dotted) are shown for reference. (Right):
Relative improvement to logistic regression versus Bayes error rate.
15
5
1 3 10 30 100
Ratio u/l
Figure 9.2 Test error versus u/l ratio for 5 % Bayes error (a = 0.23). Test errors of
minimum entropy logistic regression (◦) and mixture models (+). The errors of logistic
regression (dashed), and logistic regression with all labels known (dash-dotted) are shown
for reference.
outliers. For each class, the examples are generated from a mixture of two Gaussians
centered on the same mean: a unit variance component gathers 98 % of examples,
while the remaining 2 % are generated from a large variance component, where
each variable has a standard deviation of 10. The mixture model used by EM is
now slightly misspecified since the whole distribution is still modeled by a simple
two-components Gaussian mixture. The results, displayed in the left-hand-side of
figure 9.3, should be compared with figure 9.2. The generative model dramatically
suffers from the misspecification and behaves worse than logistic regression for all
sample sizes. The unlabeled examples have first a beneficial effect on test error, then
have a detrimental effect when they overwhelm the number of labeled examples.
On the other hand, the discriminative models behave smoothly as in the previous
case, and the minimum entropy criterion performance steadily improves with the
addition of unlabeled examples.
The last series of experiments illustrate the robustness with respect to the
cluster assumption, by which the decision boundary should be placed in low-
density regions. The samples are drawn from a distribution such that unlabeled
data do not convey information, and where a low-density p(x) does not indicate
class separation. This distribution is modeled by two Gaussian clusters, as in the
first series of experiments, but labeling is now independent from clustering: example
xi belongs to class 1 if xi2 > xi1 and belongs to class 2 otherwise; the Bayes
decision boundary now separates each cluster in its middle. The mixture model is
unchanged. It is now far from the model used to generate data. The right-hand side
plot of figure 9.3 shows that the favorable initialization of EM does not prevent
the model from being fooled by unlabeled data: its test error steadily increases
with the amount of unlabeled data. Conversely, the discriminative models behave
well, and the minimum entropy algorithm is not distracted by the two clusters; its
performance is nearly identical to the one of training with labeled data only (cross-
164 Entropy Regularization
20 30
25
15
10 10
5 0
1 3 10 30 100 1 3 10 30 100
Ratio u/l Ratio u/l
Figure 9.3 Test error versus u/l ratio for a = 0.23. Average test errors for minimum
entropy logistic regression (◦) and mixture models (+). The test error rates of logistic
regression (dotted), and logistic regression with all labels known (dash-dotted) are shown
for reference. (Left): Experiment with outliers. (Right): Experiment with uninformative
unlabeled data.
Table 9.1 Error rates (%) of minimum entropy (ME) versus consistency method (CM),
for a = 0.23, l = 50, and (a) pure Gaussian clusters, (b) Gaussian clusters corrupted by
outliers, and (c) class boundary separating one Gaussian cluster
validation provides λ values close to zero), which can be regarded as the ultimate
achievable performance in this situation.
above minimum entropy, and which does not show any sign of improvement as the
sample of unlabeled data grows. In particular, when classes do not correspond to
clusters, the consistency method performs random class assignments.
In fact, the experimental setup, which was designed for the comparison of global
classifiers, is not favorable to manifold methods, since the input data are truly
50-dimensional. In this situation, finding a discriminant similarity measure may
require numerous degrees of freedom, and the consistency method provides only
one tuning parameter: the scale parameter σ 2 . Hence, these results illustrate that
manifold learning requires more tuning efforts for truly high-dimensional data, and
some recent techniques may respond to this need (Sindhwani et al., 2005).
9.6 Conclusion
Theorem 9.1 Consider the two-class linear classification problem with linearly sep-
arable labeled examples, where the classifier is obtained by optimizing
⊤
P (1|x; (w, b)) = 1/(1 + e−(w x+b) ) with the semi-supervised minimum entropy cri-
terion (9.7), under the constraint that ||w|| ≤ B. The margin of that linear classifier
converges toward the maximum possible margin among all such linear classifiers,
as the bound B goes to infinity.
Proof Consider the logistic regression model P (1|x; θ) parameterized by θ =
(w, b). Let zi ∈ {−1, +1} be a binary variable defined as follows: if xi is a positive
labeled example, zi = +1; if xi is a negative labeled example, zi = −1; if xi is an
unlabeled example, zi = sign(P (1|x; θ) − 1/2). The margin for the ith labeled or
9.6 Conclusion 167
where the indices [1, l] and [l + 1, n] correspond to labeled and unlabeled data,
respectively.
On the one hand, for all θ such that there exists an example with non-negative
margin, the cost (9.11) is trivially upper-bounded by − ln(2) if the example is
labeled and −λ ln(2) otherwise. On the other hand, by the linear separability
assumption, there exists θ = (w, b) with, say, ||w|| = 1 such that mi > 0. Consider
now the cost obtained with the admissible solution Bθ as B → +∞. In this limit,
since mi (Bθ) = Bmi (θ), all the terms of the finite sum (9.11) converge to zero, so
that the value of the cost converges to its maximum value (lim B→+∞ C(Bθ) = 0).
Hence, in the limit of B → +∞ all margins of the maximizer of C are positive.
We now show that the maximizer of C achieves the largest minimal margin. The
cost (9.11) is simplified by using the following equivalence relations when B → +∞:
∗
Let us write m > 0 the minimum margin among the labeled examples and
m∗ > 0 the minimum margin among the unlabeled examples, N ∗ the number
of minimum margin labeled examples (with mi = m∗ ), and N∗ the number of
∗
minimum margin unlabeled examples (with mi = m∗ ). As e−Bmi = o(e−Bm )
when mi > m∗ , we obtain
∗ ∗
C(Bθ) = −N ∗ e−Bm + o(e−Bm ) − λN∗ Bm∗ e−Bm∗ + o(Bm∗ e−Bm∗ ) .
∗
Now we note that if m∗ < m∗ , then Bm∗ e−Bm∗ = o(e−Bm ), and that if m∗ ≥ m∗
∗ ∗
then e−Bm = o(Bm∗ e−Bm ). Hence, depending on whether m∗ < m∗ or m∗ ≥ m∗
we either obtain
∗ ∗
C(Bθ) = −N ∗ e−Bm + o(e−Bm ) (9.12)
or
∗
C(Bθ) = −λN∗ Bm∗ e−Bm∗ + o(Bm∗ e−Bm ) . (9.13)
Now, consider two different values of θ, θ1 and θ2 , giving rise to minimum margins
168 Entropy Regularization
M1 and M2 respectively, with M1 > M2 . The solution Bθ1 will be prefered to Bθ2
if C(Bθ1 )/C(Bθ2 ) < 1. From (9.12) and (9.13), we see that it does not matter
whether Mi is among the labels or the unlabeled, but only whether M1 > M2
or M2 > M1 . In all cases C(Bθ1 )/C(Bθ2 ) → 0 when M1 > M2 . This allows
the conclusion that as B → ∞, the global maximum of C(Bθ) over θ tends to a
maximum margin solution, where the minimum margin M (over both labeled and
unlabeled examples) is maximized.
10 Data-Dependent Regularization
10.1 Introduction
A substantial number of algorithms and methods exist for solving supervised learn-
ing problems with little or no assumptions about the distribution generating the
samples. Semi-supervised learning methods, in contrast, have to rely on assump-
tions about the problem so as to relate the available unlabeled data to possible
class decisions. The most common such assumption is the cluster assumption (see
chapter 1, or (Seeger, 2000b)) that, loosely speaking, prefers class decisions that
cut between rather than through clusters of unlabeled points. The effect of the
assumption is that it can significantly reduce the set of possible (reasonable) deci-
sions that need to be considered in response to a few labeled examples. The same
effect can also be achieved through representational constraints (e.g., (Blum and
Mitchell, 1998)).
The definition of what constitutes a cluster and how the cluster assumption is
170 Data-Dependent Regularization
formalized varies from one method to another. For example, clusters may be defined
in terms of a weighted graph so that class decisions correspond to a graph partition
(Szummer and Jaakkola, 2001; Blum and Chawla, 2001; Blum et al., 2004). In a
regularization setting, the graph may be used to introduce a smoothness penalty on
the discriminant function so as to limit how the discriminant function can change
within graph neighborhoods (e.g., see chapter 12). Alternatively, we may define
a model for each cluster via generative mixture models, and associate a single
class decision (distribution over classes) with each mixture component (e.g., see
chapter 3).
The strength of the bias from unlabeled data can be directly controlled via the
regularization parameter or by weighting likelihoods corresponding to labeled and
unlabeled data. The choice of the weight may have a substantial effect on the
resulting classifier, however (e.g., (Corduneanu and Jaakkola, 2002)).
We approach here the semi-supervised learning problem as a regularization prob-
regularization lem , consistent with the broader cluster assumption, but define the regularization
approach penalty by appealing to information theory. The key idea is to express the penalty as
a bit cost of deviating decisions from those consistent with some assumed structure
over the unlabeled examples. In our case the structure corresponds to a collection of
overlapping sets or regions that play a role similar to clusters; decisions are biased
to be the same within each set and their specification is tied to the marginal distri-
bution over the examples. In practice, the sets can be derived from weighted graph
neighborhoods for discrete objects or from ǫ-balls covering the unlabeled points.
information We begin by introducing the overall information regularization principle. The
regularization structure of the remaining sections is modeled after figure 10.1, successively elabo-
rating the principle under variations in the example space, type of unlabeled data
that is available, and which modeling assumptions we are willing to make.
Consider a typical semi-supervised learning problem with a few labeled examples
((x1 , y1 ), . . . , (xl , yl )) and a large number of unlabeled examples (xl+1 , . . . , xn ) or
the marginal distribution p(x). We assume that the labels are discrete taking values
in Y = {1, . . . , M } for some finite M . The goal is to estimate the conditional dis-
tributions Q(y|x) associated with each available example x (labeled or unlabeled).
We will introduce the information regularization approach here from two alter-
native perspectives: smoothness and communication. By smoothness we mean con-
straining how Q(y|x) is allowed to vary from one point to another. The smoothness
preference is expressed as a regularization penalty over different choices of Q(·|x),
x ∈ X. The communication perspective, on the other hand, characterizes the reg-
ularization penalty in terms of the cost of encoding labels for all the points using
Q(y|x) relative to a basic coding scheme.
unlabeled bias as In either case the key role is played by a collection of regions, denoted by R. Each
regions region R ∈ R represents a set of a priori equivalent examples. In other words, in the
absence of any other information, we would prefer to associate the same distribution
of labels with all x ∈ R. Figure 10.2 illustrates two possible overlapping regions.
We will use these regions to exemplify the basic ideas.
10.1 Introduction 171
metric estimation
unrestricted transductive (Szummer, Jaakkola NIPS02)
finite sample
parametric transductive
unrestricted inductive
full marginal
parametric inductive
relational estimation
unrestricted transductive (Corduneanu, Jaakkola NIPS 04)
finite sample
(Corduneanu, Jaakkola –
parametric transductive submitted )
Consider the six unlabeled examples in region R in figure 10.2. We assume that
each point has the same probability of being a member of the region so that
P(x|R) = 1/6. The membership probabilities provide an additional degree of free-
dom for specifying smoothness constraints. Given the region R and the membership
probabilities P(x|R), x ∈ R, we would like to introduce a penalty for any varia-
tion in the conditionals Q(y|x) across the examples in the region. A natural choice
for this penalty is the Kullback-Leibler (KL) divergence between each conditional
Q(y|x) and the best common choice Q(y|R):
Q(y|x)
IR (x; y) = min P(x|R) Q(y|x) log , (10.1)
Q(·|R) Q(y|R)
x∈R y∈Y
Q(y|x)
= P(x|R) Q(y|x) log , (10.2)
Q(y|R)
x∈R y∈Y
172 Data-Dependent Regularization
1
where Q(y|R) = x∈R P(x|R)Q(y|x). Note that we can interpret the result as
mutual the mutual information between x and y within the region so long as the joint
information distribution Q(x, y) is defined as Q(y|x)P(x|R). The mutual information involves
no prior penalty on what the common distribution should be; IR (x, y) is zero if all
the points in the region are labeled y = 1 or all of them have entirely uncertain
conditionals Q(y|x) = 1/M .
Suppose now that some of the six examples in region R have been labeled. We
will formulate the resulting estimation task as a regularization problem with the
mutual information serving as a regularization penalty. To this end, let Q refer
collectively to the parameters Q(·|x), x ∈ R. Define J(Q) = IR (x; y) (which we
will extend shortly to multiple regions) so that the penalized maximum-likelihood
criterion is given by
l
log Q(yi |xi ) − λJ(Q),
i=1
where λ is a regularization parameter that balances the fit to the available labeled
points and the smoothness bias expressed by J(Q). If only one of the six points is
labeled, all the points in the region will be labeled with the observed label. This
is because the value of the regularizer is independent of the common choice within
the region but biases any differences within the region. In case of two distinctly
labeled points, the remaining points would be labeled such that the conditionals
Q(y|x) assign all their weight equally to the two observed labels while excluding all
others. The conditionals associated with the labeled points would be drawn toward
their respective labels, also excluding other than observed label values.
Multiple Regions In the single-region case the labels for unlabeled points
are pulled equally toward the optimized common distribution without further
distinguishing between the points. The notion of locality arises from multiple
regions, such as R = {R, R′ } in the figure. In this setting, the overall regularization
1. IR (x; y) is exactly the general Jensen-Shannon divergence between Q(·|x) for all x ∈ R,
weighted by P(x|R)
10.1 Introduction 173
where γ(R) represents the weight of region R, where the choice of γ(R) is a modeling
decision. γ(R) expresses a priori belief in the relative importance of the regions, thus
it is not necessarily related to P(R) = R p(x)dx, the probability of region R derived
from the generative distribution of the data.
In figure 10.2 there are three sets of equivalent points that are not further
distinguished in this regularizer. They are R \ R′ , R ∩ R′ , and R′ \ R. We call
atomic regions these sets that are not further partitioned by other regions atomic regions. By
introducing more regions, we partition the space into smaller atomic regions and
thus can make finer distinctions between the conditional distributions associated
with the points; within each atomic region, the conditional distributions can differ
only if some of the points are explicitly labeled.
A sequence of overlapping regions can mediate influence between the conditionals
associated with more “remote” points, those that do not appear in a common region.
For example, labeling any point in R \ R′ will also set all the labels in R′ \ R via
the intersection. Note, however, that labeling the points in the intersection would
not completely remove this influence; the Markov properties associated with the
regions pertain to the conditional distributions, not labels directly.
The choice of the regions, region weights γ(R), and the membership probabilities
P (x|R) will change the regularizer. While these provide additional degrees of
freedom that have to be set (or learned), there are nevertheless simple ways of
specifying them directly based on the problem. For example, suppose we are given
weighted graph a weighted undirected graph with vertex set V , edge set E, and edge weights w(u, v)
representation associated with any (u, v) ∈ E. Then we can simply associate the regions with edges,
specify equal membership probabilities for vertices in each edge, and set γ(R) equal
to the weight of the corresponding edge in the graph. The resulting regularizer is
analogous to the graph-based regularizers for discriminant functions except that it
is cast in terms of conditional probabilities.
and the region-specific coding scheme. Under these assumptions, the amount of
information that must be sent to the receiver to accurately reconstruct the samples
on average is
J(Q) = γ(R)IR (x ; y),
R∈R
which is the regularizer previously defined. Equivalently, we can rewrite the regu-
larizer as
In order to construct the regularizer we need to specify how the regions cover the
metric space along with the weights γ(R) associated with the regions. The cover
R should provide connected and significantly overlapping regions. This is necessary
since labeling one point can only affect another if they can be connected through a
overlap path of overlapping regions.
In covering the space we have to balance the size of the regions with their
10.2 Information Regularization on Metric Spaces 175
overlap. We derive here the form of the regularizer in the limit of vanishing but
highly overlapping regions. Under mild constraints about how the limit is taken,
the resulting regularizer is the same. The limiting form has the additional benefit
that it no longer requires us to engineer a particular covering of the space.
We choose the regions such that as their size approaches 0, the overlap between
neighbors approaches 100% (this is required for smoothness). In the limit, therefore,
each point belongs to infinitely many regions, resulting in an infinite sum of local
regularizers. An appropriate choice of λ, the regularization parameter, is needed to
rescale the regularizer to take into account this increase.
avoid systematic In choosing the cover R care must be taken not to introduce systematic biases
bias into the regularizer. Assuming that X has vector space structure, we can cover it
with a homogeneous set of overlapping regions of identical shape: regions centered
at the axis-parallel lattice points spaced at distance l ′ . In what follows the regions
are going to be axis-parallel cubes of length l, where l is much larger than l ′ . Because
R covers X uniformly, we can weight the regions based on the marginal density, i.e.,
γ(R) = P(R) up to a multiplicative constant.
Assuming that l and l ′ are such that l/l ′ is an integer, each (nonlattice) point
belongs to (l/l ′ )d cubic regions, where d is the dimension of the vector space. Let
R′ be the partitioning of R into atomic lattice cubes of length l ′ . Each region in
R is partitioned into (l/l ′ )d disjoint atomic cubes from R′ , and each atomic cube
is contained in (l/l ′ )d overlapping regions from R. We may now rewrite the global
regularizer as a sum over the partition R′ :
J(p) = lim P(R)IR (x; y) = lim P(R′ ) IR (x; y) =
l→0 l→0
R∈R R′ ∈R′ R⊇R′
dIR (x; y)
(l/l ′)d lim P(R′ )IR (x; y) = lim (l/l ′ )d · p(x) dx.
′
l →0 l→0 X dx
R′ ∈R′
Note that the factor in front of the integral can be factored into the regularization
parameter λ as a multiplicative constant.
Given this form of the regularizer we can argue that regions in the shape of a
cube are indeed appropriate. We start from the principle that the regularizer should
not introduce any systematic directional bias in penalizing changes in the label. If
the diameter of a region R is small enough, pR (x) is almost uniform, and p(y = 1|x)
can be approximated well by v · x + c, where v is the direction of highest variation.
In this setting we have the following result (Corduneanu and Jaakkola, 2003):
Theorem 10.1 Let R be such that diam(R) = 1. The local information regularizer
is independent of v/ v if and only if VarR [·] is a multiple of the identity.
Proof We have F (x0 ) = vv⊤ . The relevant quantity that should be independent
of v/ v is therefore v⊤ VarR [·] v. Let v = Φi / Φi , where Φi is an eigenvector
of VarR [·] of eigenvalue φi . Then v⊤ VarR [·] v = φi should not depend on the
eigenvector. If follows that VarR [·] has equal eigenvalues, thus VarR [·] = φI. The
converse is trivial.
It follows that in order to remove any directional bias, Var R [x] ≈ diam(R)2 ·I, as
is the case if R is a cube or a sphere. We thus reach our final form of the information
regularizer for metric space when the marginal is fully known:
J(p) = p(x)tr (F (x)) dx (10.3)
X
We would like to estimate a label confidence Q(·|x) (that is, a soft label in [0, 1]M )
for every x ∈ X given the knowledge of p(x), and a labeled sample {(xi , yi )}i=1...l .
The information regularization principle requires us to maximize the regularized
log likelihood:
l
max log Q(yi |xi ) − λ p(x)tr (F (x)) dx, (10.4)
{Q(y|x) ; x∈X,y∈Y} X
i=1
where F (x) = EQ(y|x) ∇x log Q(y|x) · ∇x log Q(y|x)⊤ , and the maximization is
subject to 0 ≤ Q(y|x) ≤ 1 and y∈Y Q(y|x) = 1.
1 1
0.9 x 0.9 x
0.8 0.8
P(y|x) P(y|x)
0.7 0.7
0.6 0.6
0.5 0.5
0.2 0.2
0.1 x 0.1 x
0 0
2 1.5 1 0.5 0 0.5 1 1.5 2 2 1.5 1 0.5 0 0.5 1 1.5 2
1 1
0.9 x 0.9x
0.8 0.8
P(y|x)
0.7 0.7
0.6 0.6
P(y|x)
0.5 0.5
0.2 0.2
0.1 x 0.1 x
0 0
2 1.5 1 0.5 0 0.5 1 1.5 2 2 1.5 1 0.5 0 0.5 1 1.5 2
Figure 10.3 Nonparametric conditionals that minimize the information regularizer for
various one-dimensional data densities while the label at boundary labeled points is fixed
3. Only in one dimension do the labeled points give rise to segments that can be optimized
independently.
10.2 Information Regularization on Metric Spaces 179
of class membership will not suffice for our purpose. Indeed, conditionals with very
small information regularizer can still have very complex decision boundaries, of
p-concept infinite Vapnik-Chervonenkis dimension. Instead, we rely on the p-concept Kearns
learning and Schapire (1994) model of learning full conditional densities: concepts are
functions Q(y|x) : X → [0, 1]. Then the concept class is that of conditionals with
bounded information regularizer:
⎧ ⎫
⎨ ⎬
2
Iγ (p) = Q : p(x) Q(y|x) ∇x log Q(y|x) dx ≤ γ .
⎩ X ⎭
y∈Y
If the loss function is the log loss, finding Q̂ is equivalent to maximizing the
information regularization objective (10.4) for a specific value of λ. However, we
will present the learning bound for the square loss, as it is bounded and easier to
work with. A similar result holds for the log-loss by using the equivalence results
between the log loss and square loss presented in (Abe et al., 2001).
The question is how different Q̂ (estimated from the sample) and Qopt (estimated
from the true conditional) can be due to this approximation. Learning theoretical
results provide guarantees that given enough labeled samples the minimization of
Ê [LQ ] and Ep(x)P (y|x) [LQ ] are equivalent. We say the task is learnable if with high
probability in the sample the empirical loss converges to the true loss uniformly for
all concepts as l → ∞. This guarantees that E LQ̂ approximates E LQopt well.
Formally,
where the probability is with respect to all samples of size l. The inequality should
hold for l polynomially large in 1/ǫ, 1/δ, 1/γ.
We have the following sample complexity bound on the square loss, derived in
(Corduneanu and Jaakkola, 2003):
180 Data-Dependent Regularization
Here mp (α) = P{x : p(x) ≤ α}, and cp (α) is the number of disconnected sets in
{x : p(x) > α}.
The quantities mp (·) and cp (·) characterize how difficult the classification is due
to the structure of p(x). Learning is more difficult when significant probability mass
lies in regions of small p(x) because in such regions the variation of Q(y|x) is less
constrained. Also, the larger cp (·) is, the labels of more “clusters” need to be learned
from labeled data. The two measures of complexity are well behaved for the useful
densities. Densities of bounded support, Laplace and Gaussian, as well as mixtures
of these, have mp (α) < uα, where u is some constant. Mixtures of single-mode
densities have cp (α) bounded by the number of mixtures.
Here p̂(x) is the empirical estimate of the true marginal. We compare two ways
of estimating p(x), the empirical approximation n1 nj=1 δ(x − x′j ), as well as a
Gaussian kernel density estimator. The empirical approximation leads to optimizing
10.2 Information Regularization on Metric Spaces 181
over all labelings of unlabeled data. In contrast, our algorithm contains the unla-
beled information in the regularizer.
The presented information regularization criterion can be easily optimized by
gradient-ascent or Newton-type algorithms. Note that the term σ(θ ⊤ x)σ(−θ⊤ x) =
Q(1|x)Q(−1|x) focuses on the decision boundary. Therefore, compared to the stan-
2
dard logistic regression regularizer θ , we penalize more decision boundaries cross-
ing regions of high data density. Also, the term makes the regularizer nonconvex,
making optimization potentially more difficult. This level of complexity is, however,
unavoidable by any semi-supervised algorithm for logistic regression, because the
structure of the problem introduces locally optimal decision boundaries.
n
If unlabeled data are limited, we may prefer a kernel estimate p̂(x) = n1 j=1 K(x, x′j )
to the empirical approximation, provided the regularization integral remains
tractable. In logistic regression, if the kernels are Gaussian we can make the in-
tegral tractable by approximating σ(θ ⊤ x)σ(−θ⊤ x) with a degenerate Gaussian.
Either from the Laplace approximation, or the Taylor expansion log(1 + e x ) ≈
log 2 + x/2 + x2 /8, we derive the following approximation, as in (Corduneanu and
Jaakkola, 2003):
⊤ ⊤ 1 1 ⊤ 2
σ(θ x)σ(−θ x) ≈ exp − (θ x) .
4 4
With this approximation computing the integral of the regularizer over the kernel
centered μ of variance τ I becomes integration of a Gaussian:
1 1 ⊤ 2
exp − (θ x) N(x ; μ, τ I) =
4 4
'
1 det Σθ μ⊤ (τ I − Σθ ) μ Σθ μ
exp − N x; , Σθ ,
4 det τ I 2τ 2 τ
−1 2
where Σθ = τ1 I + 12 θθ⊤ = τ I − 12 θθ⊤ / τ1 + 12 θ .
After integration only the multiplicative factor remains:
1 τ 2
− 12 1 (θ⊤ μ)2
1 + θ exp − .
4 2 4 1 + τ2 θ2
182 Data-Dependent Regularization
0.06
information regularization (empirical)
information regularization (kernel)
0.055 standard regularization
0.05
error rate
0.045
0.04
0.035
0.03
0.025
0 0.5 1 1.5 2 2.5
regularization strength (λ)
Figure 10.4 Average error rates of logistic regression with and without information
regularization on 100 random selections of 5 labeled and 100 unlabeled samples from
bivariate Gaussian classes.
information about their link structure. It is natural to believe that pages that are
linked in the same manner (common parents and common children) are biased
to have similar topics even before we see any information about their content.
Similarly, all other things being equal, pages that share common words are likely
to have similar topics. In classifying gene function, genes whose protein products
interact are more likely to participate in the same process with similar function;
or in retrieving science publications, co-cited articles, or articles published in the
same journal, are likely to have similar relevance assessments.
relational Relational classification is not new – it has been studied extensively from a
classification Bayesian network perspective, as in (Taskar et al., 2002). Nevertheless, information
regularization can exploit the relational structure with minimal assumptions about
the distribution of data, even in a nonparametric, purely transductive context.
Let us begin by representing the relational constraints as a collection of regions
(sets) R, derived from observed examples (x1 , x2 , . . . , xn ), where we expect the
labels to be similar within each region. The regions here differ from the continuous
case in that they are discrete subsets of indices {1, 2, . . . , n} in the training set. It
is useful to depict the region cover as a bipartite graph with points on one side and
regions on the other, as in figure 10.5. Note that regions can also be derived from a
metric if such a metric exists. For example, we could define regions centered at each
observed data point of a certain radius. For this reason every algorithm discussed
in this section is also applicable to finite sample metric settings.
We consider a generative process over the finite sample (x1 , x2 , . . . , xn ) by
selecting a region R from R with probability γ(R), and then an observed point xi
184 Data-Dependent Regularization
R1 Rm
… γ(R)
P (x|R)
Q(y|x)
…
x1 x2 xn−1 xn
Figure 10.5 Covering of the observed samples with a set of relational regions represented
as a bipartite graph. The lower nodes are the observed data points, and the upper nodes
are the regions.
Without constraining the family of label distributions Q(y|x), the objective that
must be optimized according to the information regularization principle is
l
1
max log Q(yi |xi ) − λJ(Q; R),
{Q(y|xi )}i=1...n l i=1
4. In the finite sample case we use the index of the example interchangeably with the
example itself.
10.3 Information Regularization and Relational Data 185
Lemma 10.3 The relational regularization objective for λ > 0 is a strictly convex
function of the conditionals {Q(y|xi )} provided that (1) each point i ∈ {1, . . . , n}
belongs to at least one region containing at least two points, and (2) the membership
probabilities P(i|R) and γ(R) are all non-zero.
where the variational distribution QR (y) can be chosen independently from Q(y|xj )
but the unique minimum is attained when QR (y) = Q(y|R) = j∈R P(j|R)Q(y|xj ).
We can extend the regularizer over both {Q(y|xi )} and {QR (y)} by defining
Q(y|xj )
J(Q, QR ; R) = γ(R) P(j|R)Q(y|xj ) log
QR (y)
R∈R j∈R y∈Y
where H(Q(·|xi )) is the Shannon entropy of the conditional. While the objective
is strictly convex, the solution cannot be written in closed form and has to be
found iteratively (e.g., via Newton-Raphson or simple bracketing when the labels
are binary). A much simpler update Q(y|xi ) = δ(y, yi ), where yi is the observed
label for xi , may suffice in practice. This update results from taking the limit of
small λ and approximates the iterative solution.
Thus the transduction information regularization algorithm in the nonparametric
setting consists of the following steps:
J(Q, R) < γ.
1 1
0.8 0.8
0.6 0.6
0.4 0.4
0.2 0.2
0 0
0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1
Figure 10.6 Clusters correctly separated by information regularization given one label
from each class.
Theorem 10.4 log2 N (γ) ≤ C(γ) + γ · n · t(R)/ minR γ(R), where C(γ) → 1 as
γ → 0, and t(R) is a property of R that does not depend on the cardinality of R.
lim N (γ) = 2.
γ→0
10.3.1.3 Experiments
1 1
0.9 0.9
0.8 0.8
0.7 0.7
0.6 0.6
0.5 0.5
0.4 0.4
0.3 0.3
0.2 0.2
0.1 0.1
0 0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Table 10.1 Webpage classification comparison between naive Bayes and information
regularization and semi-supervised naive Bayes + EM on text, link, and joint features
performs better than naive Bayes on all types of features, that combining text and
link features improves performance of the regularization method, and that on link
features the method performs better than the semi-supervised naive Bayes + EM.
10.4 Discussion
11.1 Introduction
Many semi-supervised learning algorithms rely on the geometry of the data induced
by both labeled and unlabeled examples to improve on supervised methods that use
only the labeled data. This geometry can be naturally represented by an empirical
graph g = (V, E) where nodes V = {1, . . . , n} represent the training data and edges
E represent similarities between them (cf. section 1.3.3). These similarities are given
weight matrix by a weight matrix W: Wij is non-zero iff xi and xj are “neighbors”, i.e., the edge
(i, j) is in E (weighted by Wij ). The weight matrix W can be, for instance, the
k-nearest neighbor matrix: Wij = 1 iff xi is among the k-nearest neighbors of xj
194 Label Propagation and Quadratic Criterion
or vice versa (and is 0 otherwise). Another typical weight matrix is given by the
Gaussian kernel of width σ:
xi −xj 2
Wij = e− 2σ2 . (11.1)
label propagation Given the graph g, a simple idea for semi-supervised learning is to propagate labels
on the graph. Starting with nodes 1, 2, . . . , l labeled1 with their known label (1 or
−1) and nodes l + 1, . . . , n labeled with 0, each node starts to propagate its label
to its neighbors, and the process is repeated until convergence.
An algorithm of this kind has been proposed by Zhu and Ghahramani (2002),
and is described in algorithm 11.1. Estimated labels on both labeled and unlabeled
data are denoted by Ŷ = (Ŷl , Ŷu ), where Ŷl may be allowed to differ from the given
1. If there are M > 2 classes, one can label each node i with an M -dimensional vector
(one-hot for labeled samples, i.e., with 0 everywhere except a 1 at index yi = class
of xi ), and use the same algorithms in a one-versus-rest fashion. We consider here the
classification case, but extension to regression is straightforward since labels are treated
as real values.
11.2 Label Propagation on a Similarity Graph 195
The iteration step of algorithm 11.2 can be rewritten for a labeled example (i ≤ l)
(t) 1
(t+1) j Wij ŷj + μ yi
ŷi ← 1 (11.2)
j Wij + μ + ǫ
These two equations can be seen as a weighted average of the neighbors’ current
labels, where for labeled examples we also add the initial label (whose weight is
inversely proportional to the parameter μ). The ǫ parameter is a regularization
196 Label Propagation and Quadratic Criterion
term to prevent numerical problems when the denominator becomes too small. The
convergence of this algorithm follows from the convergence of the Jacobi iteration
method for a specific linear system, and will be discussed in section 11.3.3.
Another similar label propagation algorithm was given by Zhou et al. (2004):
at each step a node i receives a contribution from its neighbors j (weighted by
the normalized weight of the edge (i, j)), and an additional small contribution
given by its initial value. This process is detailed in algorithm 11.3 below (the
name “label spreading” was inspired from the terminology used by Zhou et al.
(2004)). Compared to algorithm 11.2, it corresponds to the minimization of a
slightly different cost criterion, maybe not as intuitive: this will be studied later
in sections 11.3.2 and 11.3.3.
The proof of convergence of algorithm 11.3 is simple (Zhou et al., 2004). The
iteration equation being Ŷ (t+1) ← αLŶ (t) + (1 − α)Ŷ (0) , we have
t
Ŷ (t+1) = (αL)t Ŷ (0) + (1 − α) (αL)i Ŷ (0) .
i=0
The matrix L being similar to P = D−1 W = D−1/2 LD1/2 , it has the same
eigenvalues. Since P is a stochastic matrix by construction, its eigenvalues are in
[−1, 1], and consequently the eigenvalues of αL are in (−1, 1) (remember α < 1).
It follows that when t → ∞, (αL)t → 0 and
t
(αL)i → (I − αL)−1
i=0
so that
where P0|t (i|k) is the probability that we started from xi given that we arrived at
k after t steps of random walk (this probability can be computed from the pij ). xk
is then classified to 1 if P (t) (ystart = 1|k) > 0.5, and to −1 otherwise. The authors
propose two methods to estimate the class probabilities P (y = 1|i). One is based on
an iterative expectation-maximization (EM) algorithm, the other on maximizing
a margin-based criterion, which leads to a closed-form solution (Szummer and
Jaakkola, 2002b).
It turns out that this algorithm’s performance depends crucially on the hyper-
parameter t (the length of the random walk). This parameter has to be chosen
by cross-validation (if enough data are available) or heuristically (it corresponds
intuitively to the amount of propagation we allow in the graph, i.e., to the scale of
the clusters we are interested in). An alternative way of using random walks on the
graph is to assign to point xi a label depending on the probability of arriving at
a positively labeled example when performing a random walk starting from xi and
until a labeled example is found (Zhu and Ghahramani, 2002; Zhu et al., 2003b).
The length of the random walk is not constrained anymore to a fixed value t. In
the following, we will show that this probability, denoted by P (yend = 1|i), is equal
(up to a shift and scaling) to the label obtained with algorithm 11.1 (this is similar
to the proof by Zhu and Ghahramani (2002)).
When xi is a labeled example, P (yend = 1|i) = δyi 1 , and when it is unlabeled we
have the relation
n
P (yend = 1|i) = P (yend = 1|j)pij , (11.6)
j=1
with the pij computed as in (11.5). Let us consider the matrix P = D−1 W,
i.e., such that Pij = pij . We will denote ẑi = P (yend = 1|i) and Ẑ = (Ẑl , Ẑu )
198 Label Propagation and Quadratic Criterion
the corresponding vector split into its labeled and unlabeled parts. Similarly, the
matrices D and W can be split into four parts:
Dll 0
D =
0 Duu
Wll Wlu
W = .
Wul Wuu
This allows us to rewrite the linear system (11.7) in terms of the vector of original
labels Yl as follows:
with the sign of each element yi of Ŷu giving the estimated label of xi (which is
equivalent to comparing ẑi to a 0.5 threshold).
The solution of this random walk algorithm is thus given in closed form by a linear
system, which turns out to be equivalent to iterative algorithm 11.1 (or equivalently,
algorithm 11.2 when μ → 0 and ǫ = 0), as we will see in section 11.3.4.
On the other hand, consistency with the geometry of the data, which follows from
smoothness the smoothness (or manifold) assumption discussed in section 1.2, motivates a
assumption penalty term of the form
⎛ ⎞
n n n n
1 1⎝
Wij (ŷi − ŷj )2 = 2 ŷi2 Wij − 2 Wij ŷi ŷj ⎠
2 i,j=1 2 i=1 j=1 i,j=1
= Ŷ ⊤ (D − W)Ŷ
= Ŷ ⊤ LŶ (11.10)
graph Laplacian with L = D− W the un-normalized graph Laplacian. This means we penalize rapid
changes in Ŷ between points that are close (as given by the similarity matrix W).
Various algorithms have been proposed based on such considerations. Zhu et al.
(2003b) force the labels on the labeled data (Ŷl = Yl ), then minimize (11.10) over
Ŷu . However, if there is noise in the available labels, it may be beneficial to allow
the algorithm to relabel the labeled data (this could also help generalization in a
noise-free setting where, for instance, a positive sample had been drawn from a
region of space mainly filled with negative samples). This observation leads to a
more general cost criterion involving a tradeoff between (11.9) and (11.10) (Belkin
et al., 2004b; Delalleau et al., 2005). A small regularization term can also be added
in order to prevent degenerate situations, for instance, when the graph g has a
connected component with no labeled sample. We thus obtain the following general
labeling cost2:
Joachims (2003) obtained the same kind of cost criterion from the perspective of
spectral spectral clustering. The unsupervised minimization of Ŷ ⊤ LŶ (under the constraints
clustering Ŷ ⊤ 1 = 0 and Ŷ 2 = n) is a relaxation of the NP-hard problem of minimizing the
normalized cut of the graph g, i.e. splitting g into two subsets g + = (V + , E + ) and
2. Belkin et al. (2004b) first center the vector Yl and also constrain Ŷ to be centered:
these restrictions are needed to obtain theoretical bounds on the generalization error, and
will not be discussed in this chapter.
200 Label Propagation and Quadratic Criterion
g− = (V − , E − ) such as to minimize
i∈V + ,j∈V − Wij
,
|V + ||V − |
where the normalization by |V + ||V − | favors balanced splits. Based on this ap-
proach, Joachims (2003) introduced an additional cost which corresponds to our
part Ŷl − Yl 2 of the cost (11.11), in order to turn this unsupervised minimization
into a semi-supervised transductive algorithm (called spectral graph transducer).
Note, however, that although very similar, the solution obtained differs from the
straighforward minimization of (11.11) since
the labels are not necessarily +1 and −1, but depend on the ratio of the number
of positive examples over the number of negative examples (this follows from the
normalized cut optimization);
the constraint Ŷ 2 = n used in the unsupervised setting remains, thus leading
to an eigenvalue problem instead of the direct quadratic minimization that will be
studied in the next section;
the eigenspectrum of the graph Laplacian is normalized by replacing the ordered
Laplacian eigenvalues by a monotonically increasing function, in order to focus
on the ranking among the smallest cuts and abstract, for example, from different
magnitudes of edge weights.
Belkin and Niyogi (2003b) also proposed a semi-supervised algorithm based
on the same idea of graph regularization, but using a regularization criterion
different from the quadratic penalty term (11.10). It consists in taking advantage
graph Laplacian of properties of the graph Laplacian L, which can be seen as an operator on
functions defined on nodes of the graph g. The graph Laplacian is closely related
to the Laplacian on the manifold, whose eigenfunctions provide a basis for the
Hilbert space of L2 functions on the manifold (Rosenberg, 1997). Eigenvalues
of the eigenfunctions provide a measure of their smoothness on the manifold
(low eigenvalues correspond to smoother functions, with the eigenvalue 0 being
associated with the constant function). Projecting any function in L2 on the
first p eigenfunctions (sorted by order of increasing eigenvalue) is thus a way of
smoothing it on the manifold. The same principle can be applied to our graph
setting, thus leading to algorithm 11.4 (Belkin and Niyogi, 2003b) below. It consists
in computing the first p eigenvectors of the graph Laplacian (each eigenvector
can be seen as the corresponding eigenfunction applied on training points), then
finding the linear combination of these eigenvectors that best predicts the labels
(in the mean-squared sense). The idea is to obtain a smooth function (in the sense
that it is a linear combination of the p smoothest eigenfunctions of the Laplacian
operator on the manifold) that fits the labeled data. This algorithm does not
explicitely correspond to the minimization of a nonparametric quadratic criterion
such as (11.11) and thus is not covered by the connection shown in section 11.3.3
with label propagation algorithms, but one must keep in mind that it is based
11.3 Quadratic Cost Criterion 201
In order to minimize the quadratic criterion (11.11), we can compute its derivative
with respect to Ŷ . We will denote by S the diagonal matrix (n × n) given by
Sii = I[l] (i), so that the first part of the cost can be rewritten SŶ − SY 2 . The
derivative of the criterion is then
1 ∂C(Ŷ )
= S(Ŷ − Y ) + μLŶ + μǫŶ
2 ∂ Ŷ
= (S + μL + μǫI) Ŷ − SY.
This shows how the new labels can be obtained by a simple matrix inversion. It
is interesting to note that this matrix does not depend on the original labels, but
only on the graph Laplacian L; the way labels are “propagated” to the rest of the
graph is entirely determined by the graph structure.
An alternative (and very similar) criterion was proposed by Zhou et al. (2004),
202 Label Propagation and Quadratic Criterion
Mx = b (11.14)
i.e. exactly the update equations (11.2) and (11.3) used in algorithm 11.2. Con-
vergence of this iterative algorithm is guaranteed by the following theorem (Saad,
1996): if the matrix M is strictly diagonally dominant, the Jacobi iteration (11.15)
converges to the solution of the linear system (11.14). A matrix M is strictly di-
agonally dominant iff |Mii | > j=i |Mij |, which is clearly the case for the matrix
11.3 Quadratic Cost Criterion 203
S + μL + μǫI (remember L = D − W with Dii = i=j Wij , and all Wij ≥ 0). Note
that this condition also guarantees the convergence of the Gauss-Seidel iteration,
(t+1)
which is the same as the Jacobi iteration except that updated coordinates xi
(t+1)
are used in the computation of xj for j > i. This means we can apply Eqs. 11.2
and 11.3 with Ŷ (t+1) and Ŷ (t) sharing the same storage.
To show the equivalence between algorithm 11.3 and the minimization of C ′ given
in (11.13), we compute its derivative with respect to Ŷ :
1 ∂C ′ (Ŷ )
= Ŷ − SY + μ Ŷ − LŶ
2 ∂ Ŷ
and is zero iff
−1
Ŷ = ((1 + μ)I − μL) SY,
which is the same equation as (11.4) with μ = α/(1 − α), up to a positive factor
(which has no effect on the classification since we use only the sign).
It is interesting to study the limit case when μ → 0. In this section we will set ǫ = 0
to simplify notations, but one should keep in mind that it is usually better to use a
small positive value for regularization. When μ → 0, the cost (11.11) is dominated
by Ŷl − Yl 2 . Intuitively, this corresponds to
1. forcing Ŷl = Yl , then
2. minimizing Ŷ ⊤ LŶ .
Writing Ŷ = (Yl , Ŷu ) (i.e. Ŷl = Yl ) and
Lll Llu
L=
Lul Luu
If we consider now Eq. 11.12 where Ŷl is not constrained anymore, when ǫ = 0 and
μ → 0, using the continuity of the inverse matrix application at I, we obtain that
Ŷl → Yl and
Ŷu = −L−1
uu Lul Ŷl ,
to the linear system (11.8). It is immediately seen that this system is exactly the
same as the one obtained in (11.16). The equivalence of the solutions discussed in
the previous section between the linear system and iterative algorithms thus shows
that the random walk algorithm described in section 11.2.2 is equivalent to the
iterative algorithm 11.2 when μ → 0, i.e., when we keep the original labels instead
of iteratively updating them by (11.2).
Analogy with Electric Networks Zhu et al. (2003b) also link this solution to
heat kernels and give an electric network interpretation taken from Doyle and Snell
(1984), which we now present. This analogy is interesting as it gives a physical
interpretation to the optimization and label propagation framework studied in this
chapter. Let us consider an electric network built from the graph g by adding
resistors with conductance Wij between nodes i and j (the conductance is the
inverse of the resistance). The positive labeled nodes are connected to a positive
voltage source (+1V ), the negative ones to a negative source (−1V ), and we want to
compute the voltage on the unlabeled nodes (i.e., their label). Denoting the intensity
between i and j by Iij , and the voltage by Vij = ŷj − ŷi , we use Ohm’s law,
Kirchoff’s law states that the sum of currents flowing out from i (such that Iij > 0)
is equal to the sum of currents flowing into i (Iij < 0). Here, it is only useful to
apply it to unlabeled nodes as the labeled ones are connected to a voltage source,
and thus receive some unknown (and uninteresting) current. Using (11.17), we can
rewrite (11.18),
0 = Wij (ŷj − ŷi )
j
= Wij ŷj − ŷi Wij
j j
= (WŶ − DŶ )i
= −(LŶ )i ,
and since this is true for all i > l, it is equivalent in matrix notations to
which is exactly (11.16). Thus the solution of the limit case (when labeled examples
are forced to keep their given label) is given by the voltage in an electric network
where labeled nodes are connected to voltage sources and resistors correspond to
weights in the graph g.
11.4 From Transduction to Induction 205
The previous algorithms all follow the transduction setting presented in section
1.2.4. However, it could happen that one needs an inductive algorithm, for instance,
inductive setting in a situation where new test examples are presented one at a time and solving the
linear system turns out to be too expensive. In such a case, the cost criterion
(11.11) naturally leads to an induction formula that can be computed in O(n)
time. Assuming that labels ŷ1 , . . . , ŷn have already been computed by one of the
algorithms above, and we want the label ŷ of a new point x: we can minimize
C(ŷ1 , . . . , ŷn , ŷ) only with respect to this new label ŷ, i.e. minimize
⎛ ⎞
constant + μ ⎝ WX (x, xj )(ŷ − ŷj )2 + ǫŷ 2 ⎠ ,
j
From the beginning of the chapter, we have assumed that the class label is given by
the sign of ŷ. Such a rule works well when classes are well separated and balanced.
However, if this is not the case (which is likely to happen with real-world data
sets), the classification resulting from the label propagation algorithms studied in
this chapter may not reflect the prior class distribution.
A way to solve this problem is to perform class mass normalization (Zhu et al.,
2003b), i.e. to rescale classes so that their respective weights over unlabeled ex-
amples match the prior class distribution (estimated from labeled examples). Until
now, we had been using a scalar label ŷi ∈ [−1, 1], which is handy in the binary
case. In this section, for the sake of clarity, we will use an M -dimensional vector (M
being the number of classes), with each element ŷi,k between 0 and 1 giving a score
(or weight) for class k (see also footnote 1 at the beginning of this chapter). For
206 Label Propagation and Quadratic Criterion
instance, in the binary case, a scalar ŷi ∈ [−1, 1] would be represented by the vector
1 1 ⊤
2 (1 + ŷi ), 2 (1 − ŷi ) , where the second element would be the score for class −1.
Class mass normalization works as follows. Let us denote by pk the prior proba-
bility of class k obtained from the labeled examples, i.e.,
l
1
pk = yi,k .
l i=1
The mass of class k as given by our algorithm will be the average of estimated
weights of class k over unlabeled examples, i.e.,
n
1
mk = ŷi,k .
u
i=l+1
In general, such a scaling gives a better classification performance when there are
enough labeled data to accurately estimate the class distribution, and when the
unlabeled data come from the same distribution. Note also that if there is an m
such that each class mass is mk = mpk , i.e., the masses already reflect the prior
class distribution, then the class mass normalization step has no effect, as wk = m−1
for all k.
smoothness and As introduced in section 1.2, the smoothness assumption (or its semi-supervised
cluster variant) about the underlying target function y(·) (such that y(xi ) = yi ) is at
assumptions the core of most of the algorithms studied in this book, along with the cluster
assumption (or its variant, the low-density separation assumption). The former
implies that if x1 is near x2 , then y1 is expected to be near y2 , and the latter implies
that the data density is low near the decision surface. The smoothness assumption is
intimately linked to a definition of what it means for x1 to be near x2 , and that can
be embodied in a similarity function on input space, WX (·, ·), which is at the core
of the graph-based algorithms reviewed in this chapter, transductive support vector
machines (SVMs) (where WX is seen as a kernel), and semi-supervised Gaussian
processes (where WX is seen as the covariance of a prior over functions), both in
part II of this book, as well as the algorithms based on a first unsupervised step to
learn a better representation (part IV).
The central claim of this section is that in order to obtain good results with algo-
rithms that rely solely on the smoothness assumption and on the cluster assumption
(or the low-density separation assumption), an acceptable decision surface (in the
sense that its error is at an acceptable level) must be “smooth” enough. This can
happen if the data for each class lie near a low-dimensional manifold (i.e., the man-
ifold assumption), and these manifolds are smooth enough, i.e., do not have high
curvature where it matters, i.e., where a wrong characterization of the manifold
would yield to large error rate. This claim is intimately linked to the well-known
curse of dimensionality, so we start the section by reviewing results on generaliza-
tion error for classical nonparametric learning algorithms as dimension increases.
We present theoretical arguments that suggest notions of locality of the learning al-
gorithm that make it sensitive to the dimension of the manifold near which data lie.
These arguments are not trivial extensions of the arguments for classical nonpara-
metric algorithms, because the semi-supervised algorithms such as those studied
in this book involve expansion coefficients (e.g., the ŷj in equation (11.19)) that
are nonlocal, i.e., the coefficient associated with the jth example xj may depend
on inputs xi that are far from xj , in the sense of the similarity function or kernel
WX (xi , xj ). For instance, a labeled point xi far from an unlabeled point xj (i.e.
WX (xi , xj ) is small) may still influence the estimated label of xj if there exists a
path in the neighborhood graph g that connects xi to xj (going through unlabeled
examples).
In the last section (11.6.5), we will try to argue that it is possible to build
nonlocal learning nonlocal learning algorithms, while not using very specific priors about the task
to be learned. This goes against common folklore that when there are not enough
training examples in a given region, one cannot generalize properly in that region.
This would suggest that difficult learning problems such as those encountered in
artificial intelligence (e.g., vision, language, robotics, etc.) would benefit from the
development of a larger array of such nonlocal learning algorithms.
208 Label Propagation and Quadratic Criterion
where i runs over all the examples (labeled and unlabeled), and kX (·, ·) is a
symmetric function (kernel) that is either chosen a priori or using the whole data
set X (and does not need to be positive semi-definite). The learning algorithm is
then allowed to choose the scalars b and αi .
Most of the decision functions learned by the algorithms discussed in this chapter
can be written as in (11.20). In particular, the label propagation algorithm 11.2
leads to the induction formula (11.19) corresponding to
b = 0
αi = ŷi
W (x, xi )
kX (x, xi ) = X . (11.21)
ǫ + j WX (x, xj )
The Laplacian regularization algorithm (algorithm 11.4) from Belkin and Niyogi
(2003b), which first learns about the shape of the manifold with an embedding based
on the principal eigenfunctions of the Laplacian of the neighborhood, also falls into
Nyström formula this category. As shown by Bengio et al. (2004a), the principal eigenfunctions can
be estimated by the Nyström formula:
√ n
n
fk (x) = vk,i kX (x, xi ), (11.22)
λk i=1
where (λk , vk ) is the kth principal (eigenvalue, eigenvector) pair of the Gram matrix
K obtained by Kij = kX (xi , xj ), and where kX (·, ·) is a data-dependent equivalent
kernel derived from the Laplacian of the neighborhood graph g. Since the resulting
decision function is a linear combination of these eigenfunctions, we obtain again a
kernel machine (11.20).
In the following, we say that a kernel function kX (·, ·) is local if for all x ∈ X,
there exists a neighborhood N(x) ⊂ X such that
f (x) ≃ b + αi kX (x, xi ). (11.23)
xi ∈N(x)
Intuitively, this means that only the near neighbors of x have a significant contribu-
tion to f (x). For instance, if kX is the Gaussian kernel, N(x) is defined as the points
in X that are close to x with respect to σ (the width of the kernel). If (11.23) is an
equality, we say that kX is strictly local. An example is when WX is the k-nearest
neighbor kernel in algorithm 11.2. kX obtained by (11.21) is then also the k-nearest
neighbor kernel, and we have N(x) = Nk (x) the set of the k nearest neighbors of
11.6 Curse of Dimensionality for Semi-Supervised Learning 209
x, so that
ŷi
f (x) = .
k
xi ∈Nk (x)
Similarly, we say that kX is local-derivative if there exists another kernel k̃X such
that for all x ∈ X, there exists a neighborhood N(x) ⊂ X such that
∂f
(x) ≃ αi (x − xi )k̃X (x, xi ). (11.24)
∂x
xi ∈N(x)
Because here k̃X is proportional to a Gaussian kernel with width σ, the neighbor-
hood N(x) is also defined as the points in X which are close to x with respect
to σ. Again, we say that kX is strictly local-derivative when (11.24) is an equality
(for instance, when kX is a thresholded Gaussian kernel, i.e. kX (x, xi ) = 0 when
x − xi > δ).
curse of The term curse of dimensionality has been coined by Bellman (1961) in the
dimensionality context of control problems, but it has been used rightfully to describe the poor
generalization performance of local nonparametric estimators as the dimensionality
increases. We define bias as the square of the expected difference between the
estimator and the true target function, and we refer generically to variance as
the variance of the estimator, in both cases the expectations being taken with
respect to the training set as a random variable. It is well known that classical
nonparametric estimators must trade bias and variance of the estimator through
a smoothness hyperparameter, e.g., kernel bandwidth σ for the Nadarya-Watson
estimator (Gaussian kernel). As σ increases, bias increases and the predictor
bias-variance becomes less local, but variance decreases, hence the bias-variance dilemma (Geman
dilemma et al., 1992) is also about the locality of the estimator.
A nice property of classical nonparametric estimators is that one can prove their
convergence to the target function as n → ∞, i.e., these are consistent estimators.
One obtains consistency by appropriately varying the hyperparameter that controls
the locality of the estimator as n increases. Basically, the kernel should be allowed
210 Label Propagation and Quadratic Criterion
to become more and more local, so that bias goes to zero, but the “effective number
of examples” involved in the estimator at x,
1
n ,
i=1 kX (x, xi )2
(equal to k for the k-nearest neighbor estimator, with kX (x, xi ) = 1/k for xi a
neighbor of x) should increase as n increases, so that variance is also driven to 0.
For example, one obtains this condition with limn→∞ k = ∞ and limn→∞ nk = 0
for the k-nearest neighbor. Clearly the first condition is sufficient for variance to
go to 0 and the second for the bias to go to 0 (since k/n is proportional to the
volume around x containing the k-nearest neighbors). Similarly, for the Nadarya-
Watson estimator with bandwidth σ, consistency is obtained if limn→∞ σ = 0
and limn→∞ nσ = ∞ (in addition to regularity conditions on the kernel). See
the book by Härdle et al. (2004) for a recent and easily accessible exposition
(with web version). The bias is due to smoothing the target function over the
volume covered by the effective neighbors. As the intrinsic dimensionality of the
data increases (the number of dimensions that they actually span locally), bias
increases. Since that volume increases exponentially with dimension, the effect of
the bias quickly becomes very severe. To see this, consider the classical example of
the [0, 1]d hypercube in Rd with uniformly distributed data in the hypercube. To
hold a fraction p of the data in a subcube of it, that subcube must have sides of
length p1/d . As d → ∞, p1/d → 1, i.e., we are averaging over distances that cover
almost the whole span of the data, just to keep variance constant (by keeping the
effective number of neighbors constant).
For a wide class of kernel estimators with kernel bandwidth σ, the expected
generalization error (bias plus variance, ignoring the noise) can be written as follows
(Härdle et al., 2004):
C1
expected error = + C2 σ 4 ,
nσ d
with C1 and C2 not depending on n nor d. Hence an optimal bandwidth is
chosen proportional to n−1/(4+d) , and the resulting generalization error converges
in n−4/(4+d) , which becomes very slow for large d. Consider for example the increase
in number of examples required to get the same level of error, in one dimension
versus d dimensions. If n1 is the number of examples required to get a level of error
(4+d)/5
e, to get the same level of error in d dimensions requires on the order of n1
examples, i.e. the required number of examples is exponential in d. However, if the
data distribution is concentrated on a lower-dimensional manifold, it is the manifold
dimension that matters. Indeed, for data on a smooth lower-dimensional manifold,
the only dimension that, for instance, a k-nearest neighbor classifier sees is the
dimension of the manifold, since it only uses the Euclidean distances between the
near neighbors, and if they lie on such a manifold then the local Euclidean distances
approach the local geodesic distances on the manifold (Tenenbaum et al., 2000).
The curse of dimensionality on a manifold (acting with respect to the dimensionality
11.6 Curse of Dimensionality for Semi-Supervised Learning 211
Let us first consider how semi-supervised learning algorithms could learn about
the shape of the manifolds near which the data concentrate, and how either a
high-dimensional manifold or a highly curved manifold could prevent this when the
algorithms are local, in the local-derivative sense discussed above. As a prototypical
example, let us consider the algorithm proposed by Belkin and Niyogi (2003b)
(algorithm 11.4). The embedding coordinates are given by the eigenfunctions f k
from (11.22).
The first derivative of fk with respect to x represents the tangent vector of the
kth embedding coordinate. Indeed, it is the direction of variation of x that gives
rise locally to the maximal increase in the kth coordinate. Hence the set of manifold
tangent vectors { ∂f∂x
1 (x) ∂f2 (x)
, ∂x , . . . , ∂f∂x
d (x)
} spans the estimated tangent plane of the
manifold.
By the local-derivative property (strict or not), each of the tangent vectors at x
is constrained to be exactly or approximately in the span of the difference vectors
x − xi , where xi is a neighbor of x. Hence the tangent plane is constrained to be a
subspace of the span of the vectors x − xi , with xi neighbors of x. This is illustrated
in figure 11.2. In addition to the algorithm of Belkin and Niyogi (2003b), a number
212 Label Propagation and Quadratic Criterion
x
xi
Figure 11.2 Geometric illustration of the effect of the local derivative property shared
by semi-supervised graph-based algorithms and spectral manifold learning algorithms.
The tangent plane at x is implicitly estimated, and is constrained to be in the span of the
vectors (xi − x), with xi near neighbors of x. When the number of neighbors is small the
estimation of the manifold shape has high variance, but when it is large, the estimation
would have high bias unless the true manifold is very flat.
In this section we focus on algorithms of the type described in part III of the book
(graph-based algorithms), using the notation and the induction formula presented
in this chapter (on label propagation and a quadratic criterion unifying many of
these algorithms).
We consider here that the ultimate objective is to learn a decision surface, i.e.,
we have a classification problem, and therefore the region of interest in terms of
theoretical analysis is mostly the region near the decision surface. For example, if
we do not characterize the manifold structure of the underlying distribution in a
region far from the decision surface, it is not important, as long as we get it right
near the decision surface. Whereas in the previous section we built an argument
based on capturing the shape of the manifold associated with each class, here we
focus directly on the discriminant function and on learning the shape of the decision
surface.
An intuitive view of label propagation suggests that a region of the manifold
around a labeled (e.g. positive) example will be entirely labeled positively, as
the example spreads its influence by propagation on the graph representing the
underlying manifold. Thus, the number of regions with constant label should be on
the same order as (or less than) the number of labeled examples. This is easy to see
in the case of a sparse weight matrix W, i.e. when the affinity function is strictly
local. We define a region with constant label as a connected subset of the graph
g where all nodes xi have the same estimated label (sign of yˆi ), and such that no
other node can be added while keeping these properties. The following proposition
then holds (note that it is also true, but trivial, when W defines a fully connected
graph, i.e. N(x) = X for all x).
Proof By contradiction, if this proposition is false, then there exists a region with
constant estimated label that does not contain any labeled example. Without loss
of generality, consider the case of a positive constant label, with xl+1 , . . . , xl+q the
q samples in this region. The part of the cost (11.11) depending on their labels is
l+q
μ
C(ŷl+1 , . . . , ŷl+q ) = Wij (ŷi − ŷj )2
2
i,j=l+1
⎛ ⎞
l+q
+ μ ⎝ Wij (ŷi − ŷj )2 ⎠
i=l+1 j ∈{l+1,...,l+q}
/
l+q
+ μǫ ŷi2 .
i=l+1
214 Label Propagation and Quadratic Criterion
The second term is stricly positive, and because the region we consider is maximal
(by definition) all samples xj outside of the region such that Wij > 0 verify
ŷj < 0 (for xi a sample in the region). Since all ŷi are strictly positive for
i ∈ {l + 1, . . . , l + q}, this means this second term can be strictly decreased by
setting all ŷi to 0 for i ∈ {l + 1, . . . , l + q}. This also sets the first and third terms to
zero (i.e. their minimum), showing that the set of labels ŷi are not optimal, which
is in contradiction with their definition as the labels that minimize C.
This means that if the class distributions are such that there are many distinct
regions with constant labels (either separated by low-density regions or regions with
samples from the other class), we will need at least the same number of labeled
samples as there are such regions (assuming we are using a strictly local kernel
such as the k-nearest neighbor kernel, or a thresholded Gaussian kernel). But this
number could grow exponentially with the dimension of the manifold(s) on which
the data lie, for instance in the case of a labeling function varying highly along each
dimension, even if the label variations are “simple” in a nonlocal sense, e.g. if they
alternate in a regular fashion.
When the affinity matrix W is not sparse (e.g., Gaussian kernel), obtaining
such a result is less obvious. However, for local kernels, there often exists a sparse
approximation of W (for instance, in the case of a Gaussian kernel, one can set to
0 entries below a given threshold or that do not correspond to a k-nearest neighbor
relationship). Thus we conjecture that the same kind of result holds for such dense
weight matrices obtained from a local kernel.
Another indication that highly varying functions are fundamentally hard to learn
with graph-based semi-supervised learning algorithms is given by the following
theorem (Bengio et al., 2006a):
Theorem 11.2 Suppose that the learning problem is such that in order to achieve a
given error level for samples from a distribution P with a Gaussian kernel machine
(11.20), then f must change sign at least 2k times along some straight line (i.e.,
in the case of a classifier, the decision surface must be crossed at least 2k times by
that straight line). Then the kernel machine must have at least k examples (labeled
or unlabeled).
The theorem is proven for the case where kX is the Gaussian kernel, but we
conjecture that the same result applies to other local kernels, such as the normalized
Gaussian or the k-nearest neighbor kernels implicitly used in graph-based semi-
supervised learning algorithms. It is coherent with proposition 11.1 since both tell
us that we need at least k examples to represent k “variations” in the underlying
target classifier, whether along a straight line or as the number of regions of differing
class on a manifold.
11.7 Discussion 215
What conclusions should we draw from the previous results? They should help
to better circumscribe where the current local semi-supervised learning algorithms
are likely to be most effective, and they should also help to suggest directions of
research into nonlocal learning algorithms, either using nonlocal kernels or similarity
functions, or using altogether other principles of generalization.
When applying a local semi-supervised learning algorithm to a new task, one
should consider the plausibility of the hypothesis of a low-dimensional manifold
near which the distribution concentrates. For some problems this could be very
reasonable a priori (e.g., printed digit images varying mostly due to a few geometric
and optical effects). For others, however, one would expect tens or hundreds of
degrees of freedom (e.g., many artificial intelligence problems, such as natural
language processing or recognition of complex composite objects).
Concerning new directions of research suggested by these results, several possible
approaches can already be mentioned:
Semi-supervised algorithms that are not based on the neighborhood graph, such
as the one presented in chapter 9, in which a discriminant training criterion for
supervised learning is adapted to semi-supervised learning by taking advantage of
the cluster hypothesis, more precisely, the low-density separation hypothesis (see
section 1.2).
Algorithms based on the neighborhood graph but in which the kernel or similarity
function (a) is nonisotropic or (b) is adapted based on the data (with the spread in
different directions being adapted). In that case the predictor will be neither local
nor local-derivative. More generally, the structure of the similarity function at x
should be inferred based not just on the training data in the close neighborhood of
x. For an example of such nonlocal learning in the unsupervised setting, see (Bengio
and Monperrus, 2005; Bengio et al., 2006b).
Other data-dependent kernels could be investigated, but one should check whether
the adaptation allows nonlocal learning, i.e., that information at x could be used
to usefully alter the prediction at a point x′ far from x.
More generally, algorithms that learn a similarity function Sim(x, y) in a nonlocal
way (i.e., taking advantage of examples far from x and y) should be good candidates
to consider to defeat the curse of dimensionality.
11.7 Discussion
the quadratic criterion helps to understand what these algorithms really do. The
solution can also be linked to physical phenomena such as voltage in an electric
network built from the graph, which provides other ways to reason about this
problem. In addition, the optimization framework leads to a natural extension of
the inductive setting that is closely related to other classical nonparametric learning
algorithms such as k-nearest neighbor or Parzen windows. Induction will be studied
in more depth in the next chapter, and the induction formula (11.19) will turn out to
be the basis for a subset approximation algorithm presented in chapter 18. Finally,
we have shown that the local semi-supervised learning algorithms are likely to be
limited to learning smooth functions for data living near low-dimensional manifolds.
Our approach of locality properties suggests a way to check whether new semi-
supervised learning algorithms have a chance to scale to higher-dimensional tasks
or learning less smooth functions, and motivates further investigation in nonlocal
learning algorithms.
Acknowledgments
The authors would like to thank the editors and anonymous reviewers for their
helpful comments and suggestions. This chapter has also greatly benefited from
advice from Mikhail Belkin, Dengyong Zhou, and Xiaojin Zhu, whose papers
first motivated this research (Belkin and Niyogi, 2003b; Zhou et al., 2004; Zhu
et al., 2003b). The authors also thank the following funding organizations for their
financial support: Canada Research Chair, NSERC, and MITACS.
12 The Geometric Basis of Semi-Supervised
Learning
12.1 Introduction
Consider now the left panel in figure 12.2. In the absence of unlabeled data the
black dot (marked “?”) is likely to be classified as blue (marked “−”). The unlabeled
data, however, makes classifying it as red (marked “+”) seem much more reasonable.
A third example is shown in figure 12.3. In the left panel, the unlabeled point
may be classified as blue (−) to agree with its nearest neighbor. However, unlabeled
geometry of data shown as gray clusters in the right panel change our belief.
unlabeled data These examples show how the geometry of unlabeled data may radically change
our intuition about classifier boundaries. We seek to translate these intuitions into
a framework for learning from labeled and unlabeled examples.
Recall now the standard setting of learning from examples. Given a pattern space
X, there is a probability distribution P on X × R according to which examples are
generated for function learning. Labeled examples are (x, y) pairs drawn according
to P. Unlabeled examples are simply x ∈ X sampled according to the marginal
distribution PX of P.
As we have seen, the knowledge of the marginal PX can be exploited for better
function learning (e.g., in classification or regression tasks). On the other hand,
if there is no identifiable relation between PX and the conditional P(y|x), the
knowledge of PX is unlikely to be of use.
Two possible connections between PX and P(y|x) can be stated as the following
important assumptions (also see the tutorial introduction in chapter 1 for related
discussion):
12.1 Introduction 219
+ _ + _
? ?
where V is some loss function, such as squared loss (yi − f (xi ))2 for regularized
least squares (RLS) or the soft margin loss function max [0, 1 − yi f (xi )] for SVM.
Penalizing the RKHS norm imposes smoothness conditions on possible solutions.
The classical representer theorem states that the solution to this minimization
representer problem exists in HK and can be written as
theorem l
f ∗ (x) = αi K(xi , x). (12.2)
i=1
Therefore, the problem is reduced to optimizing over the finite dimensional space of
coefficients αi , which is the algorithmic basis for SVM, RLS, and other regression
and classification schemes.
We first consider the case when the marginal distribution is already known.
We note that the Laplace operator as well as any differentiable operator will
satisfy the boundedness condition, assuming that the kernel is sufficiently differen-
tiable.
The representer theorem above allows us to express the solution f ∗ directly in
terms of the labeled data, the (ambient) kernel K, and the marginal PX . If PX
is unknown, we see that the solution may be expressed in terms of an empirical
estimate of PX . Depending on the nature of this estimate, different approximations
to the solution may be developed. In the next section, we consider a particular
approximation scheme that leads to a simple algorithmic framework for learning
from labeled and unlabeled data.
l
1 γI
= argmin V (xi , yi , f ) + γA f 2K + f T Lf , (12.5)
f ∈HK l i=1 (u + l)2
where Wij are edge weights in the data adjacency graph, f = [f (x1 ), . . . , f (xl+u )]T ,
graph Laplacian and L is the graph Laplacian given by L = D − W . Here, the diagonal matrix D
is given by Dii = l+u 1
j=1 Wij . The normalizing coefficient (u+l)2 is the natural scale
factor for the empirical estimate of the Laplace operator (on a sparse adjacency
l+u
graph, one may normalize by i,j=1 Wij instead). The following version of the
representer theorem shows that the minimizer has an expansion in terms of both
labeled and unlabeled examples and is a key to our algorithms.
12.2 Incorporating Geometry in Regularization 223
12.3 Algorithms
γI l
α∗ = (JK + γA lI + LK)−1 Y. (12.7)
(u + l)2
Here, K is the (l + u) × (l + u) Gram matrix over labeled and unlabeled points;
Y is an (l + u) dimensional label vector given by Y = [y1 , . . . , yl , 0, . . . , 0]; and J is
an (l + u) × (l + u) diagonal matrix given by J = diag(1, . . . , 1, 0, . . . , 0) with the
first l diagonal entries as 1 and the rest 0.
Note that when γI = 0, (12.7) gives zero coefficients over unlabeled data. The
coefficients over labeled data are exactly those for standard RLS.
l
1
β ⋆ = max βi − β T Qβ (12.8)
β∈Rl
i=1
2
l 1
subject to the contraints : i=1 yi βi = 0, 0 ≤ βi ≤ l , i = 1, ...l , where
γI
Q = Y JK(2γA I + 2 LK)−1 J T Y. (12.9)
(u + l)2
Here, Y is the diagonal matrix Yii = yi , K is the Gram matrix over both the labeled
and the unlabeled data; L is the data adjacency graph Laplacian; J is an l × (l + u)
matrix given by Jij = 1 if i = j, xi is a labeled example, and Jij = 0 otherwise. To
obtain the optimal expansion coefficient vector α∗ ∈ R(l+u) , one has to solve the
Laplacian SVM following linear system after solving the quadratic program above :
12.3 Algorithms 225
Table 12.1
Laplacian SVM/RLS
Input: l labeled examples {(xi , yi )}li=1 , u unlabeled examples {xj }l+u
j=l+1
Output: Estimated function f : Rn → R
γI
α∗ = (2γA I + 2 LK)−1 J T Y β ⋆ . (12.10)
(u + l)2
One can note that when γI = 0, the SVM QP and Eqs. 12.9 and 12.10, give
zero expansion coefficients over the unlabeled data. The expansion coefficients over
the labeled data and the Q matrix are as in standard SVM, in this case. Laplacian
SVMs can be easily implemented using standard SVM software and packages for
solving linear systems.
In section 12.4, we will discuss a data-dependent kernel defined using unlabeled
examples (Sindhwani et al., 2005), with which standard supervised SVM/RLS
implement Laplacian SVM/RLS. In table 12.1, we outline these algorithms.
The choice of the regularization parameters γA , γI is a subject of future research.
If there are enough labeled data, they can be be based on cross-validation or
performance on a held-out test set. In figure 12.4 we provide an intuition toward
effect of the role of these parameters on a toy two-moons data set. When γI = 0, Laplacian
increasing γI SVM recovers standard supervised SVM boundaries. As γI is increased, the effect
of unlabeled data increases and the classification boundaries are appropriately
adjusted.
In figure 12.5 we plot the learning curves for Laplacian SVM/RLS on a two-class
226 The Geometric Basis of Semi-Supervised Learning
2 2 2
1 1 1
0 0 0
1 1 1
1 0 1 2 1 0 1 2 1 0 1 2
γ = 0.03125 γ = 0 γA = 0.03125 γI = 0.01 γA = 0.03125 γI = 1
A I
Figure 12.4 Two moons data set: Laplacian SVM with increasing intrinsic regulariza-
tion.
0.2 0.2
0.15 0.15
Error Rate
Error Rate
0.1 0.1
0.05 0.05
20 40 60 80 100 120 140 160 180 200 220 50 100 150 200
Number of Labeled Examples Number of Labeled Examples
image recognition problem. In many such real-world application settings, one may
expect significant benefit from utilizing unlabeled data and high-quality out-of-
sample extensions with these algorithms. For further empirical results see (Belkin
et al., 2004c; Sindhwani et al., 2005) and elsewhere in this book.
data, (ii) extrinsic regularization, and (iii) intrinsic regularization. Since no labeled
data are available, the first term does not arise anymore. Therefore we are left with
the following optimization problem:
Of course, only the ratio γγAI matters. As before, f 2I can be approximated using
the unlabeled data. Choosing f 2I = M ∇M f, ∇M f and approximating it by the
Clustering empirical Laplacian, we are left with the following optimization problem :
f∗ = argmin γf 2K + (f (xi ) − f (xj ))2 . (12.12)
P 2
i f (xi )=0;
P
i f (xi ) =1 i∼j
f ∈HK
Note that without the additional constraints (cf. (Belkin et al., 2004b)) the above
problem gives degenerate solutions.
As in the semi-supervised case, a version of the empirical representer theorem
holds showing that the solution to (12.12) admits a representation of the form
u
f∗ = αi K(xi , · ).
i=1
where 1 is the vector of all ones and α = (α1 , . . . , αu ) and K is the corresponding
Gram matrix.
Letting P be the projection onto the subspace of Ru orthogonal to K1, one
obtains the solution for the constrained quadratic problem, which is given by the
eigenvalue generalized eigenvalue problem.
problem
P (γK + KLK)P v = λP K 2 P v. (12.13)
2 2 2
1 1 1
0 0 0
1 −1 −1
1 0 1 2 −1 0 1 2 −1 0 1 2
γ = 1e 06 γ = 0.0001 γ = 0.1
The fully supervised case represents the other end of the spectrum of learning. Since
standard supervised algorithms (SVM and RLS) are special cases of manifold regu-
larization, our framework is also able to deal with a labeled data set containing no
unlabeled examples. Additionally, manifold regularization can augment supervised
learning with intrinsic regularization, possibly in a class-dependent manner, which
suggests the following learning problem:
l
1
f ∗ = argmin V (xi , yi , f ) + γA f 2K + γI+ f+
T
L+ f+ + γI− f−
T
L− f− . (12.14)
f ∈HK l i=1
supervised Here we introduce two intrinsic regularization parameters γI+ , γI− and regularize
manifold separately for the two classes : f+ , f− are the vectors of evaluations of the function
regularization f , and L+ , L− are the graph Laplacians, on positive and negative examples
respectively. The solution to the above problem for RLS and SVM can be obtained
γI+ L+ 0
by replacing γI L by the block-diagonal matrix in the Laplacian
0 γI− L−
SVM and Laplacian RLS algorithms.
12.4 Data-Dependent Kernels for Semi-Supervised Learning 229
l
f ∗ (x) = αi K̃(xi , x). (12.16)
i=1
With the new kernel K̃ , this representer theorem reduces the minimization
other algorithms problem (12.5) to that of estimating the l expansion coefficients α∗ . In addition to
recovering the algorithms in section 12.3, this kernel can also be used to implement,
e.g., semi-supervised extensions of support vector regression, one-class SVM, and
Gaussian processes (see (Sindhwani et al., 2006)).
To develop an intuition toward how the intrinsic norm warps the structure of an
RKHS, consider the pictures shown in figure 12.4. A practitioner of kernel methods
would approach the two-circles problem posed in figure 12.1 by choosing a kernel
function K(x, y), and then taking a particular linear combination of this kernel
230 The Geometric Basis of Semi-Supervised Learning
(a) gaussian kernel centered (b) gaussian kernel centered (c) classifier learnt
on labeled point1 on labeled point2 in the RKHS
(a) deformed kernel centered (b) deformed kernel centered (c) classifier learnt
on labeled point1 on labeled point2 in the deformed RKHS
centered at the two labeled points in order to construct a classifier. Figure 12.7 (a,b)
shows this attempt with the popular Gaussian kernel. The resulting linear decision
warping surface, shown in figure 12.7 (c), is clearly inadequate for this problem.
interpretation in In figure 12.8 (a,b) we see level sets for the deformed kernel K̃ centered on the
pictures two labeled points in the two-circles problem.
The kernel deforms along the circle under the influence of the unlabeled data.
Using this kernel, instead of K(x, y), produces a satisfactory class boundary with
just two labeled points, as shown in figure 12.8 (c).
The procedure described above is a general nonparametric approach for con-
structing data-dependent kernels for semi-supervised learning. This approach differs
from prior constructions that have largely focussed on data-dependent methods for
parameter selection to choose a kernel from some parametric family, or by defining
a kernel matrix on the data points alone (transductive setting).
12.5 Linear Methods for Large-Scale Semi-Supervised Learning 231
l
1 γI
w⋆ = argmin V (xi , yi , wT xi ) + γA w2 + wT X T LXw. (12.17)
w∈Rd l i=1 (u + l)2
1. Fast matrix-vector products can also be formed for dense graph regularizers given by
a power series in the (sparse) graph Laplacian
232 The Geometric Basis of Semi-Supervised Learning
Table 12.2 Objective functions for comparison (in the third column for unsupervised
algorithms, additional constraints are added to avoid trivial or unbalanced solutions). In
addition to these learning problems, the framework also provides the regularized Laplacian
eigenmaps algorithm for dimensionality reduction and data representation.
where y (j) is an indicator variable for class j and w (i) ∈ Rd is the weight vector for
class i. The prior on weight vector w (i) is given by
⎧ ⎫
⎨ −w(i)T γI(i) X T LX + D(i) w(i)T ⎬
P (w(i) ) ∝ exp ,
⎩ 2 ⎭
(also discussed elsewhere in this book) that implements similar intuitions within
the framework of boosting techniques. In (Altun et al., 2005), a generalization of
the problem in Eq. 12.5 is presented for semi-supervised learning of structured
variables.
By introducing approximations to avoid graph recomputation, methods for out-
of-sample extension have also been suggested without explicitly operating in an
ambiently defined function or model space. In (Delalleau et al., 2005) an induction
formula is derived by assuming that the addition of a test point to the graph does
not change the transductive solution over the unlabeled data. In other words, if
f = [f1 . . . fl+u ft ] denotes a function defined on the augmented graph, with ft as
its value on the node corresponding to the test point, then minimizing the objective
function for graph regularization (with L as the regularizer) keeping the values on
the original nodes fixed, one can obtain a Parzen windows expression for f t :
Wti fi
ft = i ,
i Wti
where W denotes the adjacency matrix as before. In (Zhu et al., 2003c), a test point
is classified according to its nearest neighbor on the graph, whose classification is
available after transductive inference. In (Chapelle et al., 2003), graph kernels are
constructed by modifying the spectrum of the Gram matrix of a kernel evaluated
over labeled and unlabeled examples. Unseen test points are approximated in the
span of the labeled and unlabeled data, and this approximation is used to extend
the graph kernel.
The regularized Laplacian eigenmap algorithms presented in section 12.3.2 have
also been simultaneously and independently developed by Vert and Yamanishi
(2004) in the context of extending a partially known graph. The graph inference
problem is posed as follows: Suppose a graph G = (V, E) with vertices V and edges
E is observed and is known to be a subgraph of an unknown graph G′ = (V ′ , E ′ )
with V ⊂ V ′ and E ⊂ E ′ . Given the vertices V ′ − V , infer the edges E ′ − E. If
the vertices v are elements of some set V on which a kernel function K : V × V is
defined, then one can infer the graph in two steps: Find a map ψ : V → Rm and
induce a nearest-neighbor graph on the embedded points. To find the map ψ in
the RKHS corresponding to K, one can set up an optimization problem (similar to
that in regularized classification), involving a graph Laplacian-based “data fit” term
that measures how well ψ preserves the local structure of the observed graph and
the RKHS regularizer that provides ambient smoothness. This is also the objective
function of regularized Laplacian eigenmaps, and involes solving the generalized
eigenvalue problem (12.13) for multiple eigenvectors.
additional domain structure, e.g., in the form of invariances and structured outputs.
Many directions are being pursued toward improving the scalability and efficiency
of our algorithms, while developing extensions to handle unlabeled data in, e.g.,
support vector regression, one-class SVMs, and Gaussian processes. We plan to
pursue applications of these methods to a variety of real-world learning tasks, and
investigate issues concerning generalization analysis and model selection.
13 Discrete Regularization
Many real-world machine learning problems are situated on finite discrete sets, in-
cluding dimensionality reduction, clustering, and transductive inference. A variety
of approaches for learning from finite sets has been proposed from different motiva-
tions and for different problems. In most of those approaches, a finite set is modeled
as a graph, in which the edges encode pairwise relationships among the objects in
the set. Consequently many concepts and methods from graph theory are applied,
in particular, graph Laplacians.
In this chapter we present a systemic framework for learning from a finite set rep-
resented as a graph. We develop discrete analogues of a number of differential oper-
ators, and then construct a discrete analogue of classical regularization theory based
on those discrete differential operators. The graph Laplacian-based approaches are
special cases of this general discrete regularization framework. More importantly,
new approaches based on other different differential operators are derived as well.
13.1 Introduction
Many real-world machine learning problems can be described as follows: given a set
of objects X = {x1 , x2 , . . . , xl , xl+1 , . . . , xn } from a domain X (e.g., Rd ) in which the
first l objects are labeled as y1 , . . . , yl ∈ Y = {1, −1}, the goal is to predict the labels
of remaining unlabeled objects indexed from l + 1 to n. If the objects to classify
are totally unrelated to each other, we cannot make any prediction statistically
better than random guessing. Typically we may assume that there exist pairwise
relationships among data. For example, given a finite set of vectorial data, the
pairwise relationships among data points may be described by a kernel (Schölkopf
and Smola, 2002). A data set endowed with pairwise relationships can be naturally
weighted graph modeled as a weighted graph. The vertices of the graph represent the objects, and
the weighted edges encode the pairwise relationships. If the pairwise relationships
238 Discrete Regularization
are symmetric, the graph is undirected; otherwise, the graph is directed. A typical
example for directed graphs is the World Wide Web (WWW), in which hyperlinks
between webpages may be thought of as directed edges.
Any supervised learning algorithm can be applied to the above inference problem,
e.g., by training a classifier f : X → Y with the set of pairs {(x1 , y1 ), . . . , (xl , yl )},
and then using the trained classifier f to predict the labels of the unlabeled objects.
Following this approach, one will have estimated a classification function defined on
transductive the whole domain X before predicting the labels of the unlabeled objects. According
inference to (Vapnik, 1998) (see also chapter 24), estimating a classification function defined
on the whole domain X is more complex than the original problem which only
requires predicting the labels of the given unlabeled objects, and a better approach
is to directly predict the labels of the given unlabeled objects. Therefore here
we consider estimating a discrete classification function which is defined on the
given objects X only. Such an estimation problem is called transductive inference
(Vapnik, 1998). In psychology, transductive reasoning means linking particular to
particular with no consideration of the general principles. It is generally used by
young children. In contrast, deductive reasoning, which is used by adults and older
children, means the ability to come to a specific conclusion based on a general
premise (cf. figure 13.1).
It is well known that many meaningful inductive methods such as support
vector machines (SVMs) can be derived from a regularization framework, which
discrete minimizes an empirical loss plus a regularization term. Inspired by this work, we
regularization define discrete analogues of a number of differential operators, and then construct
theory a discrete analogue of classical regularization theory (Tikhonov and Arsenin, 1977;
Wahba, 1990) using the discrete operators. Much existing work, including spectral
clustering, transductive inference, and dimensionality reduction can be understood
13.2 Discrete Analysis 239
In this section, we first introduce some basic notions on graph theory, and then
propose a family of discrete differential operators, which constitute the basis of the
discrete regularization framework presented in the next section.
13.2.1 Preliminaries
weighted graph
A graph G = (V, E) consists of a finite set V, together with a subset E ⊆ V × V.
The elements of V are the vertices of the graph, and the elements of E are the
edges of the graph. We say that an edge e is incident on vertex v if e starts from
v. A self-loop is an edge which starts and ends at the same vertex. A path is a
sequence of vertices (v1 , v2 , . . . , vm ) such that [vi−1 , vi ] is an edge for all 1 < i ≤ m.
A graph is connected when there is a path between any two vertices. A graph is
undirected when the set of edges is symmetric, i.e., for each edge [u, v] ∈ E we also
have [v, u] ∈ E. In the following, the graphs are always assumed to be connected,
undirected, and have no self-loops or multiple edges; for an example, see figure 13.2.
A graph is weighted when it is associated with a function w : E → R+ which
is symmetric, i.e. w([u, v]) = w([v, u]), for all [u, v] ∈ E. The degree function
d : V → R+ is defined to be
d(v) := w([u, v]),
u∼v
Hilbert spaces where u ∼ v denote the set of the vertices adjacent with v, i.e. [u, v] ∈ E. Let
H(V ) denote the Hilbert space of real-valued functions endowed with the usual
inner product
f, gH(V ) := f (v)g(v),
v∈V
for all f, g ∈ H(V ). Similarly define H(E). In what follows, we will omit the
subscript of inner products if we do not think it is necessary. Note that function
h ∈ H(E) have not to be symmetric. In other words, we do not require h([u, v]) =
h([v, u]).
We define the discrete gradient and divergence operators, which can be thought of
as discrete analogues of their counterparts in the continuous case.
240 Discrete Regularization
i.e., ∇f is skew-symmetric.
where i denotes the index of a node of the lattice. Unlike the lattice case, the problem
that we have to deal with here is the irregularity of a general graph. Intuitively, in
our definition, before computing the variation of a function between two adjacent
vertices, we break the function value at each vertex among its adjacent edges, and
the value assigned to each edge is proportional to the edge weight. Mathematically,
such a definition can make us finally recover the well-known graph Laplacian in a
way parallel to continuous case (see section 13.2.3).
We may also define the graph gradient at each vertex. Given a function f ∈ H(V )
and a vertex v, the gradient of f at v is defined by ∇f (v) := {(∇f )([v, u])|[v, u] ∈
E}. We also often denote ∇f (v) by ∇v f. Then the norm of the graph gradient ∇f
discrete at vertex v is defined by
p-Dirichlet form
13.2 Discrete Analysis 241
21
∇v f := (∇f )2 ([u, v]) ,
u∼v
Intuitively, the norm of the graph gradient measures the roughness of a function
around a vertex, and the p-Dirichlet form the roughness of a function over the
graph. In addition, we define ∇f ([v, u]) := ∇v f . Note that ∇f has been
1/2
defined in the space H(E) as ∇f = ∇f, ∇f H(E) .
Definition 13.3 The graph divergence is an operator div : H(E) → H(V ) which
satisfies
∇f, hH(E) = f, − div hH(V ) , for all f ∈ H(V ), h ∈ H(E). (13.3)
graph divergence
In other words, − div is defined to be the adjoint of the graph gradient. Equa-
tion (13.3) can be thought of as a discrete analogue of the Stokes theorem.1 Note
that the inner products in the left and right sides of (13.3) are respectively in the
spaces H(E) and H(V ).
Proof
∇f, h = ∇f ([u, v])h([u, v])
[u,v]∈E
* *
w([u, v]) w([u, v])
= f (v) − f (u) h([u, v])
d(v) d(u)
[u,v]∈E
* *
w([u, v]) w([u, v])
= f (v)h([u, v]) − f (u)h([u, v])
d(v) d(u)
[u,v]∈E [u,v]∈E
* *
w([u, r]) w([r, v])
= f (r)h([u, r]) − f (r)h([r, v])
d(r) d(r)
r∈V u∼r r∈V v∼r
* *
w([r, v])
w([u, r])
= f (r) h([u, r]) − h([r, v])
u∼r
d(r) v∼r
d(r)
r∈V
*
w([u, r])
= f (r) h([u, r]) − h([r, u]) .
u∼r
d(r)
r∈V
In this section, we define the graph Laplacian, which can be thought of as discrete
analogue of the Laplace-Beltrami operator on Riemannian manifolds.
self-adjoint The graph Laplacian is a linear operator because both the gradient and divergence
positive definite operators are linear. Furthermore, the graph Laplacian is self-adjoint,
1 1 1
∆f, g = − div(∇f ), g = ∇f, ∇g = f, − div(∇g) = f, ∆g,
2 2 2
and positive semi-definite:
1 1
∆f, f = − div(∇f ), f = ∇f, ∇f = S2 (f ) ≥ 0. (13.7)
2 2
It immediately follows from (13.7) that
Remark 13.7 Equation (13.6) shows that our graph Laplacian defined by (13.5)
is identical to the Laplace matrix in (Chung, 1997) defined to be ∆ = D −1/2 (D −
W )D−1/2 , where D is a diagonal matrix with D(v, v) = d(v), and W a matrix
Laplace matrix with W (u, v) = w([u, v]) if [u, v] is an edge and W (u, v) = 0 otherwise. It is worth
mentioning that the matrix L = D − W is often referred to as the combinatorial
(or unnormalized) graph Laplacian, or simply the graph Laplacian. Obviously, this
Laplacian can also be derived in a similar way. Specifically, define a graph gradient
by
)
(∇f )([u, v]) := w([u, v])(f (v) − f (u)), for all [u, v] ∈ E,
Remark 13.8 For the connection between graph Laplacians (including the Lapla-
cian we presented here) and the usual Laplacian in continuous case, we refer the
convergence reader to (von Luxburg et al., 2005; Hein et al., 2005; Bousquet et al., 2004). The
main point is that, if we assume the vertices of a graph are identically and indepen-
dently sampled from some unknown but fixed distribution, when the sampling size
goes to infinity, the combinatorial graph Laplacian does not converge to the usual
Laplacian unless the distribution is uniform.
In this section, we define the graph curvature which can be regarded as a discrete
analogue of the mean curvature in continuous case.
Unlike the graph Laplacian (13.5), the graph curvature is a nonlinear operator.
As in theorem 13.6, we have
Theorem 13.10 κf = Df S1 .
Proof
w([u, v]) f (v) f (u)
w([u, v]) f (v) f (u)
(Df S1 )(v) = −) + −)
u∼v
∇u f d(v) d(u)d(v) ∇v f d(v) d(u)d(v)
1 1 f (v) f (u)
= w([u, v]) + −)
u∼v
∇u f ∇v f d(v) d(u)d(v)
w([u, v]) 1 1
f (v) f (u)
= ) + ) −) .
u∼v d(v) ∇u f ∇v f d(v) d(u)
Given a graph G = (V, E) and a label set Y = {1, −1}, the vertices v in a subset
S ⊂ V are labeled as y(v) ∈ Y. Our goal is to label the remaining unlabeled vertices,
i.e., the vertices in the complement of S. Assume a classification function f ∈ H(V ),
which assigns a label sign f (v) to each vertex v ∈ V. Obviously, a good classification
function should vary as slowly as possible between closely related vertices while
discrete changing the initial label assignment as little as possible.
regularization Define a function y ∈ H(V ) with y(v) = 1 or −1 if vertex v is labeled as positive or
negative respectively, and 0 if it is unlabeled. Thus we may consider the optimization
problem
where μ ∈]0, ∞[ is a parameter specifying the tradeoff between the two competing
terms. It is not hard to see that the objective function is strictly convex, and hence
by standard arguments in convex analysis the optimization problem has a unique
solution.
∆f ∗ + μ(f ∗ − y) = 0.
The equation in the theorem can be thought of as discrete analogue of the Euler-
Lagrange equation. It is easy to see that we can obtain a closed-form solution
heat diffusion f ∗ = μ(∆ + μI)−1 y, where I denotes the identity operator. Define the function
c : E → R+ by
1 w([u, v]) μ
c([u, v]) = ) , if u = v; and c([v, v]) = . (13.13)
1 + μ d(u)d(v) 1+μ
246 Discrete Regularization
Remark 13.15 It is easy to see that the regularizer of p = 2 can be rewritten into
2
1 f (u) f (v)
w([u, v]) ) −) , (13.15)
2 u,v d(u) d(v)
unnormalized which we earlier suggested for transductive inference (Zhou et al., 2004). A closely
regularizer related one is
1
w([u, v])(f (u) − f (v))2 , (13.16)
2 u,v
which appeared in (Joachims, 2003; Belkin et al., 2004a; Zhu et al., 2003b). From
the point of view of spectral clustering, (13.15) can be derived from the normalized
cut (Shi and Malik, 2000), and corresponds to the (normalized) graph Laplacian
(13.5); and (13.16) is derived from the ratio cut (Hagen and Kahng, 1992), and
corresponds to the combinatorial graph Laplacian (see also remark 13.7). On many
real-world experiments, a remarkable difference between these two regularizers is
that the transductive approaches based on (13.16) (Joachims, 2003; Zhu et al.,
2003b) strongly depend on the prior knowledge of proportion among different classes
while the approach based on (13.15) (Zhou et al., 2004) can work well without such
prior knowledge. For more details, we refer the reader to chapter 21 (Analysis of
Benchmarks) and chapter 11 (Label Propagation and Quadratic Criterion).
locally linear Remark 13.16 One can construct many other similar regularizers. For instance,
embedding one might consider (Roweis and Saul, 2000)
regularizer 2
1
f (v) − p([u, v])f (u) , (13.17)
2 u,v u∼v
where the function p : E → R+ is defined to be p([u, v]) = w([u, v])/d(u). Note that
p is not symmetric. This regularizer measures the difference of function f at vertex
v, and the average of f on the neighbors of v.
13.3 Discrete Regularization 247
κf ∗ + 2μ(f ∗ − y) = 0.
to obtain the solution, in which the coefficients c(t) are updated according to (13.20)
and (13.19). It can be shown that this iterative result is independent of the setting of
the initial value. Compared with the iterative algorithm (13.14) in the case of p = 2,
the coefficients in the present method are adaptively updated at each iteration, in
addition to the function being updated.
p∆p f ∗ + 2μ(f ∗ − y) = 0.
general diffusion We can construct a similar iterative algorithm to obtain the solution. Specifically,
f (t+1) (v) = c(t) ([u, v])f (t) (v) + c(t) ([v, v])y(v), for all v ∈ V, (13.24)
u∼v
where
m(t) ([u, v])
)
(t) d(u)d(v)
c ([u, v]) = (t)
, if u = v, (13.25)
m ([u, v]) 2μ
+
u∼v d(v) p
and
2μ
p
c(t) ([v, v]) = (t)
, (13.26)
m ([u, v]) 2μ
+
u∼v d(v) p
and
w([u, v]) p−2 p−2
m(t) ([u, v]) = (∇u f (t) + ∇v f (t) ). (13.27)
p
It is easy to see that the iterative algorithms in sections 13.3.1 and 13.3.2 are the
special cases of this general one with p = 2 and p = 1 respectively. Moreover, it is
interesting to note that p = 2 is a critical point.
13.4 Conclusion 249
13.4 Conclusion
Recently graph-based algorithms, in which nodes represent data points and links
encode similarities, have become popular for semi-supervised learning. In this
chapter we introduce a general probabilistic formulation called conditional harmonic
mixing (CHM), in which the links are directed, a conditional probability matrix is
associated with each link, and where the numbers of classes can vary from node
to node. The posterior class probability at each node is updated by minimizing
the Kullback-Leibler (KL) divergence between its distribution and that predicted
by its neighbors. We show that for arbitrary graphs, as long as each unlabeled
point is reachable from at least one training point, a solution always exists, is
unique, and can be found by solving a sparse linear system iteratively. This result
holds even if the graph contains loops, or if the conditional probability matrices
are not consistent. We show how, given a classifier for a task, CHM can learn its
transition probabilities. Using the Reuters database, we show that CHM improves
the accuracy of the best available classifier, for small training set sizes.
14.1 Introduction
CHM is a highly redundant model, in that for a “perfect” CHM model of a given
problem, the posterior for a given node can be computed from the posterior at any
adjacent node, together with the conditional probability matrix on the arc joining
them. However this is an idealization: CHM handles this by asking that the posterior
at a given node be that distribution such that the number of bits needed to describe
the distributions predicted at that node, by the adjacent nodes, is minimized. This
Kullback-Leibler is accomplished by minimizing a Kullback-Leibler (KL) divergence (see below).
divergence Building on an idea proposed in Zhu et al. (2003b), CHM can also be used to
improve the accuracy of another, given base classifier.
1. Recently, Zhou et al. (2005b) have extended Laplacian SSL to the case of directed arcs.
14.1 Introduction 253
1 2 3 1 2 3
A second reason for using directed arcs is that the relations between points can
themselves be asymmetric (even if both are unlabeled). For example, in k-nearest
neighbor, if point A is the nearest neighbor of point B, point B need not be that of
point A. Such asymmetric relations can be captured with directed arcs.
harmonic CHM shares with Laplacian SSL the desirable property that its solutions are
solutions harmonic, and unique, and can be computed iteratively and efficiently. It shares with
Bayesian graphical models the desirable property that it is a probabilistic model
from the ground up. We end this section with a simple but useful observation, but
first we must introduce some notation. Suppose nodes i and j (i, j ∈ {1, . . . , N }) are
connected by a directed arc from i to j (throughout, we will index nodes by i, j, k
and vector indices by a, b). We will represent the posterior at any node k as the
vector P (Xk = Ma ) ≡ Pk (so that Pk is a vector indexed by a), and the conditional
on the arc as P (Xj |Xi , G) ≡ Pji (so that Pji is a matrix indexed by the class index
at node j and the class index at node i). Then the computation of i’s prediction of
.
the posterior at j is just the matrix vector multiply Pji Pi = b (Pji )ab (Pi )b . Note
that all conditional matrices are also conditioned on the training data, the graph
structure, and other factors, which we denote collectively by G. We emphasize that
the number of classes at different nodes can differ, in which case the conditional
matrices joining them will be rectangular. Note also that the Pji are column
254 Semi-Supervised Learning with Conditional Harmonic Mixing
stochastic matrices. Similarly we will call any vector whose components are a
non-negative partition of unity a stochastic vector. Then we have the following
observation.
Proposition 14.1 Given any two stochastic vectors Pi and Pj , there always exists
a conditional probability matrix Pij such that Pi = Pij Pj .
This follows trivially from the choice (Pij )ab = (Pi )a ∀b, and just corresponds to
the special case that the probability vectors Pi and Pj are independent. This shows
that CHM is able, in principle, to model any set of posteriors on the nodes, and
that some form of regularization will therefore be needed if we expect, for example,
to learn nontrivial matrices Pij given a set of posteriors Pi . We will impose this
regularization by partitioning the Na arcs in the graph into a small number n of
equivalence classes, where n ≪ Na , such that arcs in a given equivalence class are
to have the same Pij . In this chapter, we will use nearest-neighbor relations to
determine the equivalence classes.
Zhu and Ghahramani (2002); Zhu et al. (2003b), and Zhou et al. (2004) introduce
transductive Laplacian SSL for transductive learning using graphs (see also chapter 11). Each
learning link is given a scalar weight that measures the similarity between the data points
attached to the nodes at that link’s endpoints, and each node has a scalar value. The
objective function is a weighted sum of squared differences in function values be-
tween pairs of nodes, with positive weights; minimizing this encourages the modeled
harmonic function to vary slowly across nodes. The solution is a harmonic function (Doyle
function and Snell, 1984) in which each function value is the weighted sum of neighboring
values. The function is thresholded to make the classification decision. In contrast
to Laplacian SSL, at the solution, for nondiagonal conditional probability matri-
ces, the CHM conditional harmonic property generates extra additive terms in the
function on the nodes, which are not present in the Gaussian field solution (which
is homogeneous in the function values). The two methods coincide only when the
random variables at all nodes correspond to just two classes, when the conditional
probability matrices in CHM are 2 × 2 unit matrices, where all the weights in the
Gaussian random field are unity, and where all nodes are joined by directed arcs
in both directions. Finally, one concrete practical difference is that CHM can han-
dle one-sided classification problems, where training data from only one class are
available, by using conditional posterior matrices other than unit matrices. In this
case, the objective function in (Zhu et al., 2003b) is minimized by attaching the
same label to all the unlabeled data. However, either method can handle one-sided
classification problems by leveraging results from an existing one-sided classifier;
we will explore this method below.
Zhu et al. (2003c) embed the label propagation work in a probabilistic framework
Gaussian process by showing that the model can be viewed in terms of Gaussian processes. However,
to establish the connection, extra assumptions and approximations are required: the
14.2 Conditional Harmonic Mixing 255
The structure of the CHM graph will depend on the problem at hand; however,
all graphs share the weak constraint that for every2 test node i, there must exist
a path in the graph joining i with a training node. We will refer to such nodes as
label-connected, and to the graph as a whole as label-connected if every test node
in the graph is label-connected. A neighbor of a given node i is defined to be any
node which is adjacent to node i, where “adjacent” means that there exists an arc
from j to i.
We use the following notation: we assume that the random variable at node i has
Mi states (or classes), and that the arc from node i to node j carries an M j × Mi
conditional probability matrix Pji . We adopt the convention that Pji is the Mj ×Mi
matrix of all zeros if there is no arc from node i to node j. We denote the posterior
at node i by the vector Pi ∈ RMi for unlabeled nodes, and by Qi ∈ RMi for labeled
2. For readability we use the indices i, j to denote the nodes themselves, since these
quantities appear frequently as subscripts. We use the terms “test node” and “unlabeled
node” interchangeably.
256 Semi-Supervised Learning with Conditional Harmonic Mixing
.
nodes. Denote the set of labeled nodes by L, with l = |L|, and the set of unlabeled
.
nodes by U, with u = |U|, let M(i) (N(i)) denote the set of indices of labeled
.
(unlabeled) nodes adjacent to node i, and define I = M ∪ N with n(i) = I(i).
Finally, for node i, let p(i) be the number of incoming arcs from adjacent test nodes,
and let q(i) be the number of incoming arcs from adjacent train nodes.
A given node in the graph receives an estimate of its posterior from each of
its neighbors. These estimates may not agree. Suppose that the hypothesized
distribution at node i is Qi , and let the estimates from its n(i) neighbors be Pj , j ∈
Kullback-Leibler I(i), so that Pj = Pjk Pk for each k ∈ I(i). Given Qi , the number of bits required to
divergence describe the distributions Pj is j {H(Pj )+D(Pj |Qi )}, where H is the entropy and
D the KL divergence. Since we wish to use Qi to describe the combined distributions
Pj as closely as possible, we require that this number of bits be minimized. For fixed
Pj , this is accomplished by setting (Qi )a = (1/n(i)) n(i) j=1 (Pj )a . A function on a
harmonic graph is called harmonic (Doyle and Snell, 1984; Zhu et al., 2003b) if at each internal
function node the value of the function is the (possibly weighted) mean of the values at its
neighboring points (an internal node, as opposed to a boundary node, is one whose
function value is not fixed; below we will just use the terms “unlabeled node” and
“labeled node” for internal and boundary nodes). Assuming that a solution exists,
then at the solution, the posterior at a given node is the weighted mean of the
posteriors of its neighbors, where the weights are conditional probability matrices;
hence the name “conditional harmonic mixing.”
where P3 = (1, 0, 0, . . . ) and where the ones in the matrices represent unit matrices.
We wish to prove four properties of these equations, for any choice of conditional
probability matrices P23 , P21 , P12 , and P13 : first, that a solution always exists;
second, that it is unique; third, that it results in stochastic vectors for the solution
14.3 Learning in CHM Models 257
2
P23
P21 P12 3
P13
1
Figure 14.2 A simple three-point CHM graph.
P2 and P3 ; and fourth, that Jacobi iterates will converge to it (by solving with Jacobi
iterates, we will be able to take advantage of the sparseness of larger problems, as
we will see below). Rearranging, we have
1 − 21 P12 P1 1 P13 P3
= . (14.2)
− 21 P21 1 P2 2 P23 P3
The equations will always take this general form, where the matrix on the left is
square (but not necessarily symmetric) and of side Cu, and where the left-hand
side depends only on the unlabeled points (whose posteriors we wish to find) and
the right, only on the labeled points. Define
1
. 1 P13 P3 . 1 0 . 0 2 P12
b= , M= , N= 1
(14.3)
2 P23 P3 0 1 2 P21 0
and consider the following iterative algorithm for finding the solution, where x(0)
is arbitrary:
With the above definitions, this is a Jacobi iteration (Golub and Van Loan, 1996,
p. 510), and we have:
Theorem 14.2 (Golub and Van Loan (1996, Theorem 10.1.1)) Suppose
.
b ∈ Rd and ∆ = M − N ∈ Rd×d is nonsingular. If M is nonsingular and the
spectral radius of M −1 N satisfies the inequality ρ(M −1 N ) < 1, then the iterates
x(k) defined by (14.4) converge to x = ∆−1 b for any starting vector x.
stochastic vectors Since N here is one-half times a column-stochastic matrix, its eigenvalues have
and matrices absolute value at most 12 , so ρ(M −1 N ) < 1. Hence for this graph, a solution
always exists and is unique. If we start with stochastic vectors everywhere (chosen
arbitrarily on the unlabeled nodes), then they will remain stochastic since each
Jacobi iterate maintains this property, and the solution will be stochastic.3 Note
3. In fact this is true even if the initial vectors on the unlabeled nodes are chosen
258 Semi-Supervised Learning with Conditional Harmonic Mixing
also that the matrix M −N is diagonally dominant, and so has an inverse. However,
for the general case, N may not be proportional to a column stochastic matrix, and
furthermore M − N may not be diagonally dominant; we will need a more general
argument.
At the CHM solution, for each node i, we have the consistency conditions
⎛ ⎞ ⎛ ⎞
1 1
Pi − ⎝ Pij Pj ⎠ = ⎝ Pij Qj ⎠ , (14.5)
p(i) + q(i) p(i) + q(i)
j∈N(i) j∈M(i)
where the right-hand side is defined to be zero if M(i) = ∅. Let p = i∈U Mi , and
define a block matrix A ∈ Rp×p with ones along the diagonal and whose off-diagonal
1
elements, which are either zero matrices or are the matrices p(i)+q(i) Pij , are chosen
so that (14.5) can be written as
AP = Q, (14.6)
.
Referring to theorem 14.2, we see that in this case, A = M − N where M = I and
. 1
Nij = p(i)+q(i) Pij (recall that we define Pij to be the matrix of all zeros, if there is
no arc from node j to node i), and b is the second term on the right-hand side of
(14.7). Then the kth Jacobi iterate takes the form M P (k) = N P (k−1) + b. We can
now state:
Theorem 14.3 Consider a label-connected CHM graph with l labeled nodes. As-
sume that the vectors at the labeled nodes are fixed and stochastic. Then a solution
to the corresponding CHM equations, (14.6), always exists and is unique. Further-
more, at the solution, the vector Pi∗ ∈ RMi at the ith unlabeled node is stochastic for
all i, and the Jacobi iterates on the graph will always converge to the same solution,
regardless of the initial values given to the Pi .
Proof
1. Assume: the CHM graph is label-connected.
2. ρ(N ) < 1.
2.1. Proof Consider the eigenvalue equation:
N µ = λµ. (14.8)
Just as we view N as a block matrix whose ith, jth element is the matrix
1
p(i)+q(i) Pij , similarly view µ as a vector whose ith element is a vector of
dimension Mi . Then let µi be that component of µ whose L1 norm is largest (or
any such component if there are more than one) and consider the corresponding
rows of (14.8), which encapsulates the Mi equations
⎛ ⎞
1
⎝ Pij µj ⎠ = λµi . (14.9)
p(i) + q(i)
j∈N (i)
where · 1 denotes the L1 norm, and where the second line follows
from an inequality satisfied by all p norms (Golub and Van Loan,
1996). Since by assumption q(i) ≥ 1, taking the L1 norm of both
sides of (14.9) gives |λ| < 1.
where the last step follows from the assumption that µi has largest
L1 norm. Thus for each j ∈ N (i), we can repeat the above argument
with µj on the right-hand side of (14.9), and the argument can then
be recursively repeated for each k ∈ N (j), until (14.9) has been
constructed for every node for which a path exists to node i. However
since the graph is label-connected, that set of nodes will include a
test node which is adjacent to a train node. The previous argument,
which assumed that q > 0, then shows that |λ| < 1. Thus, in general
for any label-connected CHM graph, |λ| < 1, and so ρ(N ) < 1.
3. A is nonsingular.
3.1 Proof Since ρ(N ) < 1, the eigenvalues of N all lie strictly within the unit
circle centered on the origin in the complex plane C. Since N = 1 − A (cf.
(14.6)), if e is an eigenvector of A with eigenvalue λ, then it is an eigenvector
of N with eigenvalue 1 − λ, and so since 1 − λ lies strictly within the unit circle
centered on the origin in C, λ itself lies strictly within the unit circle centered
on the point {1, 0} ∈ C, so λ = 0. Hence none of A’s eigenvalues vanish, and A
is nonsingular.
4. A solution to the CHM equations exists and is unique.
4.1 Proof Since A is nonsingular, AP = Q has unique solution P = A−1 Q.
5. At the solution, the random vector Pi ∈ RMi at each unlabeled node is stochastic,
regardless of its initial value.
(0)
5.1 Proof For all unlabeled nodes, choose Pi to be that stochastic vector
whose first component is 1 and whose remaining components vanish. Then from
(k)
(14.7), by construction Pi is stochastic for all k. Hence from theorem 14.2 and
steps 2 and 3 above, the Jacobi iterates will converge to a unique solution, and
at that solution the Pi will be stochastic for all i ∈ N. Finally, by theorem 14.2,
the same (unique) solution will be found regardless of the initial values of the
Pi .
beyond the requirement that they be column stochastic. In particular, it does not
assume that the conditional probability matrices on the graph are consistent, that is,
that there exists a joint probability from which all conditionals (or even any subset
of them) could be derived by performing appropriate marginalizations. The CHM
algorithm can therefore be applied using measured estimates of the conditional
probability matrices, for which no precise joint exists.
2. In general A is not symmetric (and need not be row- or column-diagonally
dominant).
3. No structure is imposed on the graph beyond its being label-connected. In
particular, the graph can contain loops.
4. The numbers of classes at each node can differ, in which case the conditional
probability matrices will be rectangular.
5. The model handles probabilistic class labels, that is, the Qi can be arbitrary
stochastic vectors.
Gauss-Seidel 6. To improve convergence, Gauss-Seidel iterations should be used, instead of
iterations Jacobi iterations. For Gauss-Seidel iterations, the error tends to zero like ρ(M −1 N )k
(Golub and Van Loan, 1996, p. 514).
Suppose that we are given the outputs of a given classifier on the data set. The
classifier was trained on the available labeled examples, but the amount of training
data is limited and we wish to use SSL to improve the results. We can adopt an
idea proposed in (Zhu et al., 2003b), and for each node in the graph, attach an
additional, labeled node, whose label is the posterior predicted for that data point.
In fact CHM allows us to combine several classifiers in this way. This mechanism has
the additional advantage of regularizing the CHM smoothing: the model can apply
more, or less, weight to the original classifier outputs, by adjusting the conditionals
on the arcs. Furthermore, for graphs that fall into several components, some of
which are not label-connected, this method results in sensible predictions for the
disconnected subgraphs; the CHM relaxation can be performed even for subgraphs
containing no labeled data, since the base classifier still makes predictions for those
lifting nodes. In the context of CHM, for brevity we call this procedure of leveraging a base
classifier over a graph “lifting”. We will explore this approach empirically below.
We are still faced with the problem of finding the conditional matrices Pij . Here we
propose one method for solving this, which we explore empirically below. Consider
again the simple CHM model shown in figure 14.2, and to simplify the exposition,
262 Semi-Supervised Learning with Conditional Harmonic Mixing
assume that the number of classes at each node is two, and in addition require
. .
that Pl = P13 = P23 and that Pu = P12 = P21 (l, u denoting labeled, unlabeled
respectively). We can parameterize the matrices as
1-v1 v2 1-v3 v4
Pl = , Pu = , (14.10)
v1 1-v2 v3 1-v4
where 0 ≤ vi ≤ 1 ∀i. Now suppose that the posteriors on every node in figure 14.2
are given, and denote components by, e.g., [P1a , P1b ]. In that case, (14.1) may be
rewritten as
⎡ ⎤⎡ ⎤ ⎡ ⎤
−P3a P3b −P2a P2b v1 2P1a − P2a − P3a
⎢ ⎥⎢ ⎥ ⎢ ⎥
⎢ P3a −P3b −P2b ⎥ ⎢ ⎥ ⎢ ⎥
⎢ P2a ⎥ ⎢ v2 ⎥ = ⎢ 2P1b − P2b − P3b ⎥ , (14.11)
⎢ −P P3b −P1a ⎥ ⎢
P1b ⎦ ⎣ v3 ⎦ ⎣ 2P2a − P1a − P3a ⎥
⎥ ⎢
⎣ 3a ⎦
P3a −P3b P1a −P1b v4 2P2b − P1b − P3b
The posteriors Pi can simply be the outputs of a given classifier on the problem, if
the classifier outputs are well-calibrated probabilities, or thresholded vectors (whose
elements are 0 or 1) for arbitrary classifiers. To summarize: given some estimate of
the posteriors on every node, the conditional probability matrices on the arcs can
be learned by solving a quadratic programming problem.
If sufficient labeled data are available, then a validation set can be used to determine
the optimal graph architecture (i.e., to which neighbors each point should connect).
However, often labeled data are scarce, and in fact semi-supervised learning is really
aimed at this case - that is, when labeled data are very scarce, but unlabeled data
are plentiful. Thus in general for SSL methods it is highly desirable to find a way
around having to use validation sets to choose either the model or its parameters.
In this chapter we will use model averaging: that is, for a given graph, given a
classifier, use CHM to lift its results; then do this for a variety of graphs, and simply
average the posteriors assigned by CHM to each node, across all graphs. This, in
combination with learning the conditionals, makes CHM a largely parameter-free
approach (once a general algorithm for constructing the graphs has been chosen),
although training using many graphs may be more computationally expensive than
using a validation set to choose one.
14.7 Experiments 263
14.7 Experiments
We applied CHM to the problem of text categorization, and to five of the benchmark
classification tasks provided with this book.
Results The results are collected below. For the one-sided task, we plot F1 versus
training set size, for Rocchio, Rocchio plus CHM with unit matrices, and Rocchio
plus CHM for learned matrices in figure 14.3. It is interesting that, although on
this task using unit conditional probability matrices gives better mean results, the
learned matrices have lower variance: the results for learned matrices rarely drop
below the Rocchio baseline. Results for the two-sided task are collected in tables 14.1
through 14.7, where we show results for all classifiers and for all categories, as well
as the microaveraged results.
Table 14.1 F1 for top ten categories + microaverage F1, for training set size = 10
1 0.55 0.7
0.2 0.35
1 2 5 10 20 50 100 1 2 5 10 20 50 100 1 2 5 10 20 50 100
0.85 0.85 0.75
0.8 0.8
0.6
0.75 0.7
0.5
0.7 0.6
0.65 0.5
0.4
0.6 0.4
0.3
0.55 0.3
0.7 0.7
0.65 0.65
0.6 0.6
0.55 0.55
0.5 0.5
0.45 0.45
0.4 0.4
1 2 5 10 20 50 100 1 2 5 10 20 50 100
Figure 14.3 Results for the Rocchio classifiers (solid), Rocchio lifted with CHM,
unit conditional matrices (dashed), and Rocchio lifted with learned conditional matrices
(dotted). The y-axis is F1, the x-axis, training set size. Graphs are arranged left to right
in order of increasing category. Most graphs have the y-axis range chosen to be 0.35
for comparison. The last graph (bottom right) is the microaveraged results over all ten
categories.
14.7 Experiments 267
Table 14.2 F1 for top ten categories + microaverage F1, for training set size = 20
Table 14.3 F1 for top ten categories + microaverage F1, for training set size = 50
Table 14.4 F1 for top ten categories + microaverage F1, for training set size = 100
Table 14.5 F1 for top ten categories + microaverage F1, for training set size = 200
Table 14.6 F1 for top ten categories + microaverage F1, for training set size = 500
Table 14.7 F1 for top ten categories + microaverage F1, for training set size = 1000
We also applied CHM to five of the benchmark classification data sets provided with
this book, namely data sets 1, 2, 4, 5, and 7. Each data set contains 1500 points,
with either 10 or 100 labeled points, and comes with a 12-fold validation split: all
results quoted here are microaveraged over the twelve splits. An SVM was used as
the base classifier in these experiments. Given the limited amount of training data,
we chose to use a linear SVM with a very high C parameter (C=1000), which was
effectively a hard-margin classifier.
where wi is the primal weight vector from the SVM on preprocessing alternative
i, and Ri is the radius of the smallest ball that contains the data (after the ith
preprocessing alternative). We approximate this radius by finding the distance of
the data point that is farther from the mean over the whole data set. This choosing
process usually picks one preprocessing method for all folds of a data set, often
choosing “norming.” However, some data sets (such as data set 4), alternate between
“norming” and “sphering.”
We investigated a different graph construction mechanism from that used for the
Reuters data. We call the algorithm “flood fill”. The flood fill method was found to
give similar results to the basic nearest-neighbor method, but resulted in smaller
graphs, leading to faster experiments (a typical run, for a given data set, for both
training set sizes, and for all 12 splits of the validation set, took approximately
50 minutes on a 3GHz machine). The Flood Fill method works as follows: choose
some fixed positive integer n. Add n directed arcs from each labeled node to its
nearest n unlabeled neighbors; all such arcs are assigned flavor = 1. Call the set
14.7 Experiments 271
of nodes reached in this way N1 (where N1 does not include the training nodes).
For each node in N1 , do the same, allowing arcs to land on unlabeled nodes in
N1 ; assign all arcs generated in this way flavor = 2. At the ith iteration, arcs are
allowed to fall on unlabeled nodes in Ni , but not on nodes Nj , j < i. The process
repeats until either all nodes are reached, or until no further arcs can be added
(note that graphs with disconnected pieces are allowed here). Here we smoothed
(using model averaging) using values for n of 5, 9, 15, 25, and 50. The flood fill
algorithm can create disconnected subgraphs, and since it is not clear how best to
combine outputs of graphs with different connectedness, we simply thresholded the
value at each node after each smoothing step, before taking the average.
Results We present the results in tables 14.8 through 14.11. We chose two
normalizations: the “normed/sphered/chopped” normalization, using the above
bound; or just using the “normed” normalization everywhere, combined with a soft-
margin linear SVM classifier (C=10). As in the Reuters experiments, the prior for
each data set is assumed known. The tables give accuracies on the unlabeled subsets
only. We applied a two-way ANOVA to assess the statistical significance of these
results, where the two factors are the fold number and the algorithm number, and
the prediction of the ANOVA is the correctness of a sample. For those experiments
where a main effect is found by ANOVA to be greater than a 99% significance level,
a post hoc test comparing all pairs of algorithms is performed (using the Tukey-
Kramer correction for repeated tests). Using a 99% (p < 0.01) significance level
for the post hoc comparisons, we find the results shown in the tables, where again
statistical significance is indicated with bold versus normal typeface; the results can
be summarized as follows:
In no case is there a statistically significant difference between the learned
conditional matrices, and the unit matrices, for CHM.
CHM beats the SVM for all conditions for data sets 1 and 2.
For the case of data set 4, with normed-only preprocessing, and l = 100, SVM
beats CHM.
There is no statistically significant difference between results for data set 5.
For data set 7, SVM beats CHM for l = 10, and CHM beats SVM for l = 100.
Discussion This work demonstrates that CHM can be used to improve the
performance of the best available classifier, on several data sets, when labeled
data are limited. However, the improvement is not uniform; for some data sets we
observed that adding more smoothing (arcs) improved accuracy, while for others
increased smoothing caused accuracy to drop. A method to accurately predict the
required amount of smoothing for a given problem would boost the CHM accuracies
significantly. We attempted to overcome this behavior by model averaging, that is,
averaging over different graphs, but this is a crude way to address the problem. Also
in this chapter we only discussed two simple heuristics for constructing the graphs;
272 Semi-Supervised Learning with Conditional Harmonic Mixing
Table 14.8 Accuracy for labeled sets of size 10, using normed/sphered/chopped prepro-
cessing
Table 14.9 Accuracy for labeled sets of size 100, using normed/sphered/chopped pre-
processing
Table 14.10 Accuracy for labeled sets of size 10, using normed preprocessing only
Table 14.11 Accuracy for labeled sets of size 100, using normed preprocessing only
14.8 Conclusions
Acknowledgments
We are given a labeled data set of input-output pairs (Xl , Yl ) = {(x1 , y1 ), . . . , (xl , yl )}
and an unlabeled data set Xu = {xl+1 , . . . , xn }. We form a graph g = (V, E) where
the vertices V are x1 , . . . , xn , and the edges E are represented by an n × n ma-
trix W . Entry Wij is the edge weight between nodes i, j, with Wij = 0 if i, j
are not connected. The entries of W have to be non-negative and symmetric,
but it is not necessary for W itself to be positive semi-definite. Let D be the
diagonal degree matrix with Dii = j Wij being the total weight on edges con-
graph Laplacian nected to node i. The combinatorial graph Laplacian is defined as L = D − W ,
which is also called the unnormalized Laplacian. The normalized graph Laplacian
is L = D−1/2 LD−1/2 = I − D−1/2 W D−1/2 .
In graph-based semi-supervised learning the Laplacian L (or L) is a central
object. Let us denote the eigenvalues of L by λ1 ≤ . . . ≤ λn , and the complete
orthonormal set of eigenvectors by φ1 . . . φn . Therefore the spectral decomposition
spectral of the Laplacian is given as L = ni=1 λi φi φ⊤ i . We refer readers to (Chung, 1997)
decomposition for a discussion of the mathematical aspects of this decomposition, but briefly
summarize two relevant properties:
φ⊤
i Lφi = λi . (15.2)
Thus, eigenvectors with smaller eigenvalues are smoother. Since {φi } forms a basis
on Rn , we can always write any function f as
n
f= αi φi , αi ∈ R (15.3)
i=1
Theorem 15.2 The graph g has k connected components if and only if λi = 0 for
i = 1, 2, . . . , k.
The corresponding eigenvectors φ1 , . . . , φk of L are constant on the nodes within
the corresponding connected component, and zero elsewhere. Note λ1 is always 0
for any graph (Chung, 1997). We will make use of this property later.
Figure 15.1 A simple graph and its Laplacian spectral decomposition. Note the eigen-
vectors become rougher with larger eigenvalues.
Kernel methods are increasingly being used for classification because of their
conceptual simplicity, theoretical properties, and good performance on many tasks.
It is attractive to create kernels specifically for semi-supervised learning. We restrict
ourselves to transduction, i.e., the unlabeled data Xu are also the test data. As a
result we only need to consider kernel matrices K ∈ Rn×n on nodes 1, . . . , n in the
graph.
In particular, we want K to respect the smoothness preferences encoded in a
graph. That is, as a regularizer the kernel should penalize functions that are not
smooth over the graph. To establish a link to the graph, we consider K having the
form
n
K= μi φi φ⊤
i , (15.5)
i=1
where φ are the eigenvectors of the graph Laplacian L, and μi ≥ 0 are the
eigenvalues of K. Since K is the non-negative sum of outer products, it is positive
semi-definite, i.e., a kernel matrix.
The matrix K defines a reproducing kernel Hilbert space (RKHS) with norm
n
α2
f 2K = f, f K = i
(15.6)
i=1
μi
n
for a function f = i=1 αi φi . Note if some μi = 0 the corresponding dimension is
not present in the RKHS, and we might define 10 = 0 here.
In many learning algorithms, regularization is expressed as an increasing function
of f K . From a semi-supervised learning point of view, we want f to be penalized
if it is not smooth with respect to the graph. Comparing the smoothness of f in
Eq. 15.4 with Eq. 15.6, we find this can be achieved by making μi small if the
Laplacian eigenvalue λi is large, and vice versa.
Indeed, Chapelle et al. (2003) and Smola and Kondor (2003) both suggest a
general principle for creating a semi-supervised kernel K from the graph Laplacian.
spectral Define a spectral transformation function r : R+ → R+ that is non-negative and
transformation decreasing. Set the kernel spectrum by μi = r(λi ) to obtain the kernel
n
K= r(λi )φi φ⊤
i . (15.7)
i=1
Note that r essentially reverses the order of the eigenvalues, so that smooth φi ’s
have larger eigenvalues in K. Since r is decreasing, a greater penalty is incurred if
a function is not smooth.
The transform r is often chosen from a parametric family, resulting in some
familiar kernels. For example Chapelle et al. (2003) and Smola and Kondor (2003)
list the following transformations on L:
15.3 Kernel Alignment 281
1
regularized Laplacian: r(λ) = λ+ǫ
2
diffusion kernel: r(λ) = exp − σ2 λ
one-step random walk: r(λ) = (α − λ) with α ≥ 2
p-step random walk: r(λ) = (α − λ)p with α ≥ 2
inverse cosine: r(λ) = cos(λπ/4)
step function: r(λ) = 1 if λ ≤ λcut
Each has its own special interpretation. The regularized Laplacian is also known
as the Gaussian field kernel (Zhu et al., 2003c). Of course there are many other
natural choices for r. Although the general principle of Eq. 15.7 is appealing, it
does not address the question of which parametric family to use. Moreover, the
hyperparameters (e.g., σ or ǫ above) in a particular parametric family may not suit
the task at hand, resulting in overly constrained kernels.
Is there an optimal spectral transformation? The following sections address
this question. The short answer is yes, in a certain sense. We select a spectral
transformation that optimizes kernel alignment to the labeled data, while imposing
an ordering constraint but otherwise not assuming any parametric form. Kernel
alignment is a surrogate for classification accuracy, and, importantly, leads to a
convex optimization problem.
The empirical kernel alignment (Cristianini et al., 2002a; Lanckriet et al., 2004a)
assesses the fitness of a kernel to training labels. The alignment has a number
of convenient properties: it can be efficiently computed before any training of
the kernel machine takes place, and based only on training data information.
The empirical alignment can also be shown to be sharply concentrated around
its expected value, allowing it to be estimated from finite samples. A connection
between high alignment and good generalization performance has been established
in (Cristianini et al., 2002a).
Frobenius As we will compare matrices, we introduce here the Frobenius product ., .F
product between two square matrices M and N of the same size:
M, N F = mij nij = Tr(M N ).
ij
The empirical kernel alignment compares the l×l kernel matrix K tr on the labeled
training set x1 , . . . , xl , and a target matrix T derived from the labels y1 , . . . , yl . One
such target matrix is Tij = 1 if yi = yj , and −1 otherwise. Note for binary {+1, −1}
training labels Yl = (y1 . . . yl )⊤ this is simply the rank one matrix T = Yl Yl ⊤ . The
empirical kernel alignment is defined as follows.
Definition 15.3 (empirical kernel alignment) Let Ktr be the kernel matrix
282 Graph Kernels by Spectral Transforms
restricted to the training points, and T the target matrix on training data. We
empirical kernel define the empirical kernel alignment as
alignment
Ktr , T F
Â(Ktr , T ) = ) . (15.8)
Ktr , Ktr F T, T F
The empirical alignment is essentially the cosine between the matrices K tr and
T . The range of the alignment is [0, 1]. The larger its value the closer is the kernel
to the target. This quantity is maximized when Ktr ∝ T .
Having introduced the alignment quantity, now let us consider the problem of semi-
supervised kernel construction using a principled nonparametric approach. In short,
we will learn the spectral transformation {μi ≡ r(λi )} (15.7) by optimizing the
resulting kernel alignment, with certain restrictions. Notice we no longer assume
a parametric function r(); instead we work with the transformed eigenvalues μi ’s
directly.
n ⊤
When the kernel matrix is defined as K = i=1 μi φi φi and the target T
given, the kernel alignment between the labeled submatrix Ktr and T is a convex
function in μi ’s. Nonetheless, in general we have to make sure K is a valid kernel
matrix, i.e., it is positive semi-definite. This is a semi-definite program (SDP),
which has high computational complexity (Boyd and Vandenberghe, 2004). We thus
restrict μi ≥ 0, ∀i. This guarantees K to be positive semi-definite, and reduces the
optimization problem into a quadratically constrained quadratic program (QCQP),
quadratically which is computationally more efficient. In a QCQP both the objective function
constrained and the constraints are quadratic, as illustrated below:
quadratic 1 ⊤
programs minimize x P0 x + q0⊤ x + r0 (15.9)
2
1 ⊤
subject to x Pi x + qi⊤ x + ri ≤ 0 i = 1···m (15.10)
2
Ax = b, (15.11)
where Pi ∈ Sn+ , i = 0, . . . , m, where Sn+ defines the set of square symmetric positive
semi-definite matrices. In a QCQP, we minimize a convex quadratic function over
a feasible region that is the intersection of ellipsoids. The number of iterations
required to reach the solution is comparable to the number required for linear
programs, making the approach feasible for large data sets.
Previous work using kernel alignment did not take into account that the “building
blocks” Ki = φi φi ⊤ were derived from the graph Laplacian with the goal of semi-
supervised learning. As such, the μi ’s can take arbitrary non-negative values and
there is no preference to penalize components that do not vary smoothly over the
graph. This shall be rectified by requiring smoother eigenvectors to receive larger
coefficients, as shown in the next section.
15.5 Semi-Supervised Kernels with Order Constraints 283
μi ≥ μi+1 , i = 1 · · · n − 1. (15.12)
where the trace constraint is replaced by (15.19) (up to a constant factor). Let
vec(A) be the column vectorization of a matrix A. Defining
M = vec(K1,tr ) · · · vec(Km,tr ) (15.23)
284 Graph Kernels by Spectral Transforms
The objective function is linear in μ, and there is a simple cone constraint, making
it a QCQP.
We can further improve the kernel. Consider a graph that has a single connected
component, i.e., any node can reach any other node via one or more edges. Such
graphs are common in practice. By the basic property of the Laplacian we know
λ1 = 0, and the corresponding eigenvector φi is a constant. Therefore K1 = φi φ⊤ i is
bias term a constant matrix. Such a constant matrix acts as a bias term in the graph kernel,
as in (15.7). We should not constrain μ1 as in definition 15.4, but allow the bias of
the kernel to vary freely. This motivates the following definition:
also, although the optimization problem is not necessarily convex. These kernels
different use different information from the original Laplacian eigenvalues λi . The maximal-
information usage alignment kernels ignore λi altogether. The order-constrained semi-supervised ker-
nels only use the order of λi and ignore their actual values. The diffusion and
Gaussian field kernels use the actual values. In terms of the degrees of freedom in
choosing the spectral transformation μi ’s, the maximal-alignment kernels are com-
pletely free. The diffusion and Gaussian field kernels are restrictive since they have
an implicit parametric form and only one free parameter. The order-constrained
semi-supervised kernels incorporate desirable features from both approaches.
We evaluate kernels on seven data sets. The data sets and the corresponding graphs
are summarized in table 15.1. baseball-hockey, pc-mac and religion-atheism are
binary document categorization tasks taken from the 20-newsgroups data set. The
distance measure is the cosine similarity between tf.idf vectors. one-two, odd-even,
and ten digits are handwritten digits recognition tasks originally from the Cedar
Buffalo binary digits database. one-two is digits “1” versus “2”; odd-even is the
artificial task of classifying odd “1, 3, 5, 7, 9” versus even “0, 2, 4, 6, 8” digits,
such that each class has several well-defined internal clusters; ten digits is 10-way
classification; isolet is isolated spoken English alphabet recognition from the UCI
repository. For these data sets we use Euclidean distance on raw features. We use
10-nearest-neighbor (10NN) unweighted graphs on all data sets except isolet which
is 100NN. For all data sets, we use the smallest m = 200 eigenvalue and eigenvector
pairs from the graph Laplacian. These values are set arbitrarily without optimizing
and do not create an unfair advantage to the order-constrained kernels. For each
data set we test on five different labeled set sizes. For a given labeled set size, we
perform 30 random trials in which a labeled set is randomly sampled from the
whole data set. All classes must be present in the labeled set. The rest is used as
an unlabeled (test) set in that trial.
We compare a total of eight different types of kernels. Five are semi-supervised
kernels: improved order-constrained kernels, order-constrained kernels, Gaussian
field kernels (section 15.2), diffusion kernels (section 15.2), and maximal-alignment
kernels (section 15.5). Three are standard supervised kernels, which do not use
unlabeled data in kernel construction: linear kernels, quadratic kernels, and radial
basis function (RBF) kernels.
We compute the spectral transformation for improved order-constrained kernels,
order-constrained kernels, and maximal-alignment kernels by solving the QCQP
using the standard solver SeDuMi/YALMIP (see (Sturm, 1999) and (Löfberg,
2004)). The hyperparameters in the Gaussian field kernels and diffusion kernels
are learned with the fminbnd() function in Matlab to maximize kernel alignment.
The bandwidth of the RBF kernels are learned using fivefold cross-validation on
labeled set accuracy. Here and below we use cross-validation – it is done independent
286 Graph Kernels by Spectral Transforms
of and after kernel alignment methods, to optimize a quantity not related to the
proposed kernels.
We apply the eight kernels to the same support vector machine (SVM) in order
to compute the accuracy on unlabeled data. For each task and kernel combination,
we choose the bound on SVM slack variables C with fivefold cross-validation on
labeled set accuracy. For multiclass classification we perform one-against-all and
pick the class with the largest margin.
Tables 15.2 through 15.8 list the results. There are two rows for each cell:
the upper row is the average test (unlabeled) set accuracy with one standard
deviation; the lower row is the average training (labeled) set kernel alignment,
and in parenthesis the average run time in seconds for QCQP on a 2.4GHz Linux
computer. Each number is averaged over 30 random trials. To assess the statistical
significance of the results, we perform paired t-test on test accuracy. We highlight
the best accuracy in each row, and those that cannot be distinguished from the
best with paired t-test at significance level 0.05.
We find that:
The five semi-supervised kernels tend to outperform the three standard supervised
15.6 Experimental Results 287
kernels. It shows that with properly constructed graphs, unlabeled data can help
classification.
The order-constrained kernel is often quite good, but the improved order-
constrained kernel is even better. All the graphs on these data sets happen to
be connected. Recall this is when the improved order-constrained kernel differs
from the order-constrained kernel by not constraining the bias term. Obviously a
flexible bias term is important for classification accuracy.
Figure 15.2 shows the spectral transformation μi of the five semi-supervised
kernels for different tasks. These are the average of the 30 trials with the largest
labeled set size in each task. The x-axis is in increasing order of λi (the original
eigenvalues of the Laplacian). The mean (thick lines) and ±1 standard deviation
(dotted lines) of only the top 50 μi ’s are plotted for clarity. The μi values are scaled
vertically for easy comparison among kernels. As expected the maximal-alignment
kernels’ spectral transformation is zigzagged, diffusion’s and Gaussian field’s are
very smooth, while (improved) order-constrained kernels are in between.
The order-constrained kernels (green) have large μ1 because of the order con-
straint on the constant eigenvector. Again this seems to be disadvantageous — the
spectral transformation tries to balance it out by increasing the value of other μ i ’s,
so that the bias term K1 ’s relative influence is smaller. On the other hand, the
improved order-constrained kernels (black) allow μ1 to be small. As a result the
rest μi ’s decay fast, which is desirable.
In summary, the improved order-constrained kernel is consistently the best among
all kernels.
15.7 Conclusion
µ scaled
µ scaled
0.6 0.6
0.4 0.4
0.2 0.2
0 0
0 10 20 30 40 50 0 10 20 30 40 50
rank rank
µ scaled
0.6 0.6
0.4 0.4
0.2 0.2
0 0
0 10 20 30 40 50 0 10 20 30 40 50
rank rank
µ scaled
0.6 0.6
0.4 0.4
0.2 0.2
0 0
0 10 20 30 40 50 0 10 20 30 40 50
rank rank
0.6
0.4
0.2
0
0 10 20 30 40 50
rank
Figure 15.2 Comparison of spectral transformation for the five semi-supervised kernels.
15.7 Conclusion 291
Acknowledgment
We thank Olivier Chapelle and the anonymous reviewers for their comments and
suggestions.
16 Spectral Methods for Dimensionality
Reduction
16.1 Introduction
error
where the vectors {eα }m α=1 define a partial orthonormal basis of the input space.
From (16.1), one can easily show that the subspace with minimum reconstruction
error is also the subspace with maximum variance. The basis vectors of this subspace
are given by the top m eigenvectors of the d × d covariance matrix,
covariance 1
C= xi x⊤
i , (16.2)
matrix n i
assuming that the input patterns xi are centered on the origin. The outputs of
PCA are simply the coordinates of the input patterns in this subspace, using
the directions specified by these eigenvectors as the principal axes. Identifying
eα as the αth top eigenvector of the covariance matrix, the output ψi ∈ Rm
for the input pattern xi ∈ Rd has elements ψiα = xi · eα . The eigenvalues of
the covariance matrix in Eq. 16.2 measure the projected variance of the high-
dimensional data set along the principal axes. Thus, the number of significant
eigenvalues measures the dimensionality of the subspace that contains most of the
data’s variance, and a prominent gap in the eigenvalue spectrum indicates that
the data are mainly confined to a lower-dimensional subspace. Figure 16.1 shows
the results of PCA applied to a toy data set in which the inputs lie within a
thin slab of three-dimensional space. Here, a simple linear projection reveals the
data’s low-dimensional (essentially planar) structure. More details on PCA can be
found in (Jolliffe, 1986). We shall see in section 16.3.2 that the idea of reducing
dimensionality by maximizing variance is also useful for nonlinear dimensionality
reduction.
296 Spectral Methods for Dimensionality Reduction
The minimum error solution is obtained from the spectral decomposition of the
Gram matrix of inner products,
Gram
matrix Gij = xi · xj . (16.4)
Denoting the top m eigenvectors of this Gram matrix by {vα }m α=1 and their
√
respective eigenvalues by {λα }m α=1 , the outputs of MDS are given by ψ iα = λα vαi .
Though MDS is designed to preserve inner products, it is often motivated by
distance the idea of preserving pairwise distances. Let Sij = xi − xj 2 denote the matrix
preservation of squared pairwise distances between input patterns. Often the input to MDS
is specified in this form. Assuming that the inputs are centered on the origin,
a Gram matrix consistent with these squared distances can be derived from the
transformation G = − 21 (I − uu⊤ )S(I − uu⊤ ), where I is the n × n identity matrix
and u = √1n (1, 1, . . . , 1)⊤ is the uniform vector of unit length. More details on MDS
can be found in (Cox and Cox, 1994).
Though based on a somewhat different geometric intuition, metric MDS yields
the same outputs ψi ∈ Rm as PCA—essentially a rotation of the inputs followed
by a projection into the subspace with the highest variance. (The outputs of both
algorithms are invariant to global rotations of the input patterns.) The Gram matrix
of metric MDS has the same rank and eigenvalues up to a constant factor as the
16.3 Graph-Based Methods 297
Linear methods such as PCA and metric MDS generate faithful low-dimensional
representations when the high-dimensional input patterns are mainly confined to
a low-dimensional subspace. If the input patterns are distributed more or less
throughout this subspace, the eigenvalue spectra from these methods also reveal the
data set’s intrinsic dimensionality—that is to say, the number of underlying modes
of variability. A more interesting case arises, however, when the input patterns
lie on or near a low-dimensional submanifold of the input space. In this case, the
structure of the data set may be highly nonlinear, and linear methods are bound
to fail.
Graph-based methods have recently emerged as a powerful tool for analyzing
high-dimensional data that have been sampled from a low-dimensional submanifold.
These methods begin by constructing a sparse graph in which the nodes represent
input patterns and the edges represent neighborhood relations. The resulting graph
(assuming, for simplicity, that it is connected) can be viewed as a discretized ap-
proximation of the submanifold sampled by the input patterns. From these graphs,
one can then construct matrices whose spectral decompositions reveal the low-
dimensional structure of the submanifold (and sometimes even the dimensional-
ity itself). Though capable of revealing highly nonlinear structure, graph-based
methods for manifold learning are based on highly tractable (i.e., polynomial-time)
optimizations such as shortest-path problems, least-squares fits, semidefinite pro-
gramming, and matrix diagonalization. In what follows, we review four broadly
representative graph-based algorithms for manifold learning: Isomap (Tenenbaum
et al., 2000), maximum variance unfolding (Weinberger and Saul, 2004; Sun et al.,
2006), locally linear embedding (Roweis and Saul, 2000; Saul and Roweis, 2003),
and Laplacian eigenmaps (Belkin and Niyogi, 2003a).
16.3.1 Isomap
distances along the submanifold are substituted for standard Euclidean distances.
Figure 16.2 illustrates the difference between these two types of distances for input
patterns sampled from a Swiss roll.
The algorithm has three steps. The first step is to compute the k-nearest
neighbors of each input pattern and to construct a graph whose vertices represent
input patterns and whose (undirected) edges connect k-nearest neighbors. The
edges are then assigned weights based on the Euclidean distance between nearest
neighbors. The second step is to compute the pairwise distances ∆ij between
all nodes (i, j) along shortest paths through the graph. This can be done using
Djikstra’s algorithm which scales as O(n2 log n + n2 k). Finally, in the third step,
the pairwise distances ∆ij from Djikstra’s algorithm are fed as input to MDS,
as described in section 16.2.2, yielding low-dimensional outputs ψi ∈ Rm for
which ψi − ψj 2 ≈ ∆2ij . The value of m required for a faithful low-dimensional
representation can be estimated by the number of significant eigenvalues in the
Gram matrix constructed by MDS.
When it succeeds, Isomap yields a low-dimensional representation in which the
Euclidean distances between outputs match the geodesic distances between input
patterns on the submanifold from which they were sampled. Moreover, there are
formal guarantees of convergence (Tenenbaum et al., 2000; Donoho and Grimes,
2002) when the input patterns are sampled from a submanifold that is isometric
to a convex subset of Euclidean space—that is, if the data set has no “holes.” This
condition will be discussed further in section 16.5.
Figure 16.2 (Left) Comparison of Euclidean and geodesic distance between two input
patterns A and B sampled from a Swiss roll. Euclidean distance is measured along the
straight line in input space from A to B; geodesic distance is estimated by the shortest
path (in bold) that only directly connects k = 12 nearest neighbors. (Right) The low-
dimensional representation computed by Isomap for n = 1024 inputs sampled from a
Swiss roll. The Euclidean distances between outputs match the geodesic distances between
inputs.
16.3 Graph-Based Methods 299
Maximum variance unfolding (Weinberger and Saul, 2004; Sun et al., 2006) is
based on computing the low-dimensional representation of a high-dimensional data
set that most faithfully preserves the distances and angles between nearby input
patterns. Like Isomap, it appeals to the notion of isometry and constructs a Gram
matrix whose top eigenvectors yield a low-dimensional representation of the data
set; unlike Isomap, however, it does not involve the estimation of geodesic distances.
Instead, the algorithm attempts to “unfold” a data set by pulling the input patterns
apart as far as possible subject to distance constraints that ensure that the final
transformation from input patterns to outputs looks locally like a rotation plus
translation. To picture such a transformation from d = 3 to m = 2 dimensions, one
can imagine a flag being unfurled by pulling on its four corners (but not so hard as
to introduce any tears).
The first step of the algorithm is to compute the k-nearest neighbors of each
input pattern. A neighborhood-indicator matrix is defined as ηij = 1 if and only if
the input patterns xi and xj are k-nearest neighbors or if there exists another input
pattern of which both are k-nearest neighbors; otherwise ηij = 0. The constraints
to preserve distances and angles between k-nearest neighbors can be written as
2 2
ψi − ψj = xi − xj (16.5)
for all (i, j) such that ηij = 1. To eliminate a translational degree of freedom in the
low-dimensional representation, the outputs are also constrained to be centered on
the origin:
ψi = 0 ∈ Rm . (16.6)
i
Finally, the algorithm attempts to “unfold” the input patterns by maximizing the
variance of the outputs,
2
var(ψ) = ψi , (16.7)
i
while preserving local distances and angles, as in (16.5). Figure 16.3 illustrates the
connection between maximizing variance and reducing dimensionality.
The above optimization can be reformulated as an instance of semi-definite
programming (Vandenberghe and Boyd, 1996). A semi-definite program is a linear
program with the additional constraint that a matrix whose elements are linear
semi-definite in the optimization variables must be positive semi-definite. Let Kij = ψi · ψj
programming denote the Gram matrix of the outputs. The constraints in Eqs. 16.5–16.7 can be
written entirely in terms of the elements of this matrix. Maximizing the variance
of the outputs subject to these constraints turns out to be a useful surrogate for
minimizing the rank of the Gram matrix (which is computationally less tractable).
The Gram matrix K of the “unfolded” input patterns is obtained by solving the
semi-definite program:
300 Spectral Methods for Dimensionality Reduction
Figure 16.3 Input patterns sampled from a Swiss roll are “unfolded” by maximizing
their variance subject to constraints that preserve local distances and angles. The middle
snapshots show various feasible (but nonoptimal) intermediate solutions of the optimiza-
tion described in section 16.3.2.
The first constraint indicates that the matrix K is required to be positive semi-
definite. As in MDS and Isomap, the outputs are derived from the eigenvalues and
eigenvectors of this Gram matrix, and the dimensionality of the underlying subman-
ifold (i.e., the value of m) is suggested by the number of significant eigenvalues.
Figure 16.4 Intuition behind LLE. (Left) n = 2000 input patterns sampled from a Swiss
roll. (Middle) Results of minimizing of (16.9) with k = 20 nearest neighbors and ℓ = 25,
ℓ = 15, and ℓ = 10 randomly chosen outputs (indicated by black landmarks) clamped
to the locations of their corresponding inputs. (Right) Two-dimensional representation
obtained by minimizing Eq. 16.9 with no outputs clamped to inputs, but subject to the
centering and orthogonality constraints of LLE.
Suppose we are given a real-valued function k : Rd ×Rd → R with the property that
there exists a map Φ : Rd → H into a dot product “feature” space H such that for
all x, x′ ∈ Rd , we have Φ(x) · Φ(x′ ) = k(x, x′ ). The kernel function k(x, x′ ) can be
viewed as a nonlinear similarity measure. Examples of kernel functions that satisfy
the above criteria include the polynomial kernels k(x, x′ ) = (1 + x · x′ )p for positive
integers p and the Gaussian kernels k(x, x′ ) = exp(−x − x′ 2 /σ 2 ). Many linear
methods in statistical learning can be generalized to nonlinear settings by employing
the so-called kernel trick — namely, substituting these generalized dot products in
feature space for Euclidean dot products in the space of input patterns (Schölkopf
and Smola, 2002). In section 16.4.1, we review the nonlinear generalization of
PCA (Schölkopf et al., 1998) obtained in this way, and in section 16.4.2, we
discuss the relation between kernel PCA and the manifold learning algorithms of
section 16.3. Our treatment closely follows that of Ham et al. (2004).
To find the top eigenvectors of C, we can exploit the duality of PCA and MDS
mentioned earlier in section 16.2.2. Observe that all solutions to Ce = νe with
ν = 0 must lie in the span of (Φ(x1 ), . . . , Φ(xn )). Expanding the αth eigenvector
as eα = i vαi Φ(xi ) and substituting this expansion into the eigenvalue equation,
we obtain a dual eigenvalue problem for the coefficients vαi , given by Kvα = λα vα ,
where λα = nνα and Kij = k(xi , xj ) is the so-called kernel matrix—that is, the
Gram matrix in feature space. We can thus interpret kernel PCA as a nonlinear
version of MDS that results from substituting generalized dot products in feature
space for Euclidean dot products in input space (Williams, 2001). Following the
prescription for MDS in section 16.2.2, we compute the top m eigenvalues and
eigenvectors of the kernel matrix. The low-dimensional outputs ψi ∈ Rm of kernel
√
PCA (or equivalently, kernel MDS) are then given by ψiα = λα vαi .
One modification to the above procedure often arises in practice. In (16.11), we
have assumed that the feature vectors in H have zero mean. In general, we cannot
assume this, and therefore we need to subtract the mean (1/n) i Φ(xi ) from each
feature vector before computing the covariance matrix in (16.11). This leads to a
304 Spectral Methods for Dimensionality Reduction
Figure 16.5 Results of kernel PCA with Gaussian and polynomial kernels applied to
n = 1024 input patterns sampled from a Swiss roll. These kernels do not lead to low-
dimensional representations that unfold the Swiss roll.
All of the algorithms in section 16.3 can be viewed as instances of kernel PCA,
with kernel matrices that are derived from sparse weighted graphs rather than a
predefined kernel function (Ham et al., 2004). Often these kernels are described as
“data-dependent” kernels, because they are derived from graphs that encode the
neighborhood relations of the input patterns in the training set. These kernel ma-
trices may also be useful for other tasks in machine learning besides dimensionality
reduction, such as classification and nonlinear regression (Belkin et al., 2004b).
In this section, we discuss how to interpret the matrices of graph-based spectral
methods as kernel matrices.
The Isomap algorithm in section 16.3.1 computes a low-dimensional embedding
by computing shortest paths through a graph and processing the resulting distances
by MDS. The Gram matrix constructed by MDS from these geodesic distances can
be viewed as a kernel matrix. For finite data sets, however, this matrix is not
guaranteed to be positive semi-definite. It should therefore be projected onto the
cone of positive semi-definite matrices before it is used as a kernel matrix in other
settings.
16.4 Kernel Methods 305
16.5 Discussion
Each of the spectral methods for nonlinear dimensionality reduction has its own
advantages and disadvantages. Some of the differences between the algorithms
have been studied in formal theoretical frameworks, while others have simply
emerged over time from empirical studies. We conclude by briefly contrasting the
statistical, geometrical, and computational properties of different spectral methods
and describing how these differences often play out in practice.
theoretical Most theoretical work has focused on the behavior of these methods in the limit
guarantees n → ∞ of large sample size. In this limit, if the input patterns are sampled from a
submanifold of Rd that is isometric to a convex subset of Euclidean space—that is,
if the data set contains no “holes”—then the Isomap algorithm from section 16.3.1
will recover this subset up to a rigid motion (Tenenbaum et al., 2000). Many image
manifolds generated by translations, rotations, and articulations can be shown to fit
into this framework (Donoho and Grimes, 2002). A variant of LLE known as Hessian
LLE has also been developed with even broader guarantees (Donoho and Grimes,
2003). Hessian LLE asymptotically recovers the low-dimensional parameterization
(up to rigid motion) of any high-dimensional data set whose underlying submanifold
is isometric to an open, connected subset of Euclidean space; unlike Isomap, the
subset is not required to be convex.
The asymptotic convergence of maximum variance unfolding has not been studied
in a formal setting. Unlike Isomap, however, the solutions from maximum variance
unfolding in section 16.3.2 are guaranteed to preserve distances between nearest
manifolds with neighbors for any finite set of n input patterns. Maximum variance unfolding also
“holes” behaves differently than Isomap on data sets whose underlying submanifold is
isometric to a connected but not convex subset of Euclidean space. Figure 16.6
contrasts the behavior of Isomap and maximum variance unfolding on two data
sets with this property.
Of the algorithms described in section 16.3, LLE and Laplacian eigenmaps scale
computation best to moderately large data sets (n < 10, 000), provided that one uses special-
purpose eigensolvers that are optimized for sparse matrices. The internal iterations
of these eigensolvers rely mainly on matrix-vector multiplications which can be done
in O(n). The computation time in Isomap tends to be dominated by the calculation
of shortest paths. The most computationally intensive algorithm is maximum vari-
ance unfolding, due to the expense of solving semi-definite programs (Vandenberghe
and Boyd, 1996) over n × n matrices.
For significantly larger data sets, all of the above algorithms present serious chal-
lenges: the bottom eigenvalues of LLE and Laplacian eigenmaps can be tightly
spaced, making it difficult to resolve the bottom eigenvectors, and the computa-
tional bottlenecks of Isomap and maximum variance unfolding tend to be pro-
hibitive. Accelerated versions of Isomap and maximum variance unfolding have
been developed by first embedding a small subset of “landmark” input patterns,
then using various approximations to derive the rest of the embedding from the
16.5 Discussion 307
Figure 16.6 Results of Isomap and maximum variance unfolding on two data sets whose
underlying submanifolds are not isometric to convex subsets of Euclidean space. (Left) 1617
input patterns sampled from a trefoil knot. (Right) n = 400 images of a teapot rotated
through 360 degrees. The embeddings are shown, as well as the eigenvalues of the Gram
matrices, normalized by their trace. The algorithms estimate the dimensionality of the
underlying submanifold by the number of appreciable eigenvalues. Isomap is foiled in this
case by nonconvexity.
landmarks. The landmark version of Isomap (de Silva and Tenenbaum, 2003) is
based on the Nyström approximation and scales very well to large data sets (Platt,
2004); millions of input patterns can be processed in minutes on a PC (though
the algorithm makes the same assumption as Isomap that the data set contains no
“holes”). The landmark version of maximum variance unfolding (Weinberger et al.,
2005) is based on a factorized approximation of the Gram matrix, derived from local
linear reconstructions of the input patterns (as in LLE). It solves a much smaller
SDP than the original algorithm and can handle larger data sets (currently, up to
n = 20, 000), though it is still much slower than the landmark version of Isomap.
Note that all the algorithms rely as a first step on computing nearest neighbors,
which naively scales as O(n2 ), but faster algorithms are possible based on special-
ized data structures (Friedman et al., 1977; Gray and Moore, 2001; Beygelzimer
et al., 2004).
Research on spectral methods for dimensionality reduction continues at a rapid
related work pace. Other algorithms closely related to the ones covered here include Hessian
LLE (Donoho and Grimes, 2003), c-Isomap (de Silva and Tenenbaum, 2003), lo-
cal tangent space alignment (Zhang and Zha, 2004), geodesic null-space analy-
308 Spectral Methods for Dimensionality Reduction
sis (Brand, 2004), and conformal eigenmaps (Sha and Saul, 2005). Motivation for
ongoing work includes the handling of manifolds with more complex geometries, the
need for robustness to noise and outliers, and the ability to scale to large data sets.
In this chapter, we have focused on nonlinear dimensionality reduction, a problem
in unsupervised learning. Graph-based spectral methods also play an important role
in semi-supervised learning. For example, the eigenvectors of the normalized graph
Laplacian provide an orthonormal basis—ordered by smoothness—for all functions
(including decision boundaries and regressions) defined over the neighborhood
graph of input patterns; see chapter 12 by Sindhwani, Belkin, and Niyogi. Likewise,
as discussed in chapter 15 by Zhu and co-workers, the kernel matrices learned
by unsupervised algorithms can be transformed by discriminative training for the
purpose of semi-supervised learning. Finally, in chapter 17, Sajama and Orlitsky
show how shortest-path calculations and multidimensional scaling can be used to
derive more appropriate feature spaces in a semi-supervised setting. In all these
ways, graph-based spectral methods are emerging to address the very broad class
of problems that lie between the extremes of purely supervised and unsupervised
learning.
17 Modifying Distances
Sajama [email protected]
Alon Orlitsky [email protected]
Learning algorithms use a notion of similarity between data points to make infer-
ences. Semi-supervised algorithms assume that two points are similar to each other
if they are connected by a high-density region of the unlabeled data. Apart from
semi-supervised learning, such density-based distance metrics also have applications
in clustering and nonlinear interpolation. In this chapter, we discuss density-based
metrics induced by Riemannian manifold structures. We present asymptotically
consistent methods to estimate and compute these metrics and present upper and
lower bounds on their estimation and computation errors. Finally, we discuss how
these metrics can be used for semi-supervised learning and present experimental
results.
17.1 Introduction
When data are in Rd , the standard similarity measure used by learning algorithms
is the Euclidean distance. Semi-supervised learning algorithms rely on the intuition
that two data points are similar to each other if they are connected by a high-density
region. For example, based on this intuition, in the case of the two-dimensional data
sample shown in figure 17.1, point 2 is closer to point 3 than to point 1. In this
chapter we consider measuring this density-based notion of similarity directly in the
form of a distance metric between all pairs of points and then using this resulting
metric in standard learning algorithms to perform semi-supervised classification.
To see how a density-based distance (DBD) metric can be defined, let us take a
closer look at the two-strips example in figure 17.2. Since there is a path between
points 2 and 3 that lies in a high density region (for example, P3), we assume them
to be similar or “closer.” Conversely, since none of the paths between points 1 and
2 (P1, P2, etc.) can avoid the low-density regions, they are ‘farther’ according to
the density-based notion of distance.
310 Modifying Distances
1 1
P2 P1
P3
2 3
2 3
P4
Figure 17.1 According to the semi- Figure 17.2 This notion of similar-
supervised smoothness assumption ity can be written in terms of property
point 2 has greater similarity (is of paths between the points.
closer) to point 3 than to point 1.
where |.|2 is the L2 norm on Rd . We can assume, without loss of generality, that all
paths are parameterized to have unit speed according to the standard Euclidean
metric on Rd and hence that LE(γ) = Euclidean length of curve γ and |γ ′ (t)|2 = 1.
density-based The DBD between two points x′ and x′′ is defined to be
distance
points. Computation error arises since this minimization cannot be done perfectly
when computational resources are limited.
This computation problem has been extensively studied (Sethian, 1999) and
finds applications in computational geometry, fluid mechanics, computer vision,
and materials science. These methods involve building a grid in Rd whose size is
exponential in d. This is inconvenient for the learning scenario where the data
dimension is usually high. It is therefore necessary to consider grids based on data
points, in which case the computational complexity grows at a rate polynomial in
sample size n. Heuristics for computing the minimum Riemannian distance using
graphs constructed on data samples have been suggested by Vincent and Bengio
(2003); Bousquet et al. (2004), and Sajama and Orlitsky (2005).
In the sections that follow, we present asymptotically consistent methods to
estimate and compute these metrics and show bounds on the estimation and
computation errors of these metrics (Sajama and Orlitsky, 2005). We also discuss
the various ways in which density-based metrics could be used for semi-supervised
learning and present experimental results.
In this section we consider the error in our knowledge of DBD metrics that comes
from the fact that we have a limited data sample, i.e., a set of d-dimensional data
points {x1 , . . . , xn } drawn i.i.d. from a probability density function p(x). In other
words, we are interested in the estimation of the path length function
LE(γ)
.
Γ(γ; p) = q(p(x))|γ ′ (t)|2 dt
t=0
(see section 17.1) for any given path γ. Note that for a fixed path γ, Γ(γ; p) is
a functional of the density p(x). Several different ways of analyzing estimators
of functionals of data density have been studied in the statistics literature. For
bounding the error in estimating the DBD metric we borrow from the proof
techniques used by Stone (1980), and Goldstein and Messer (1992).
To characterize the estimators of the path lengths and hence the DBD metric, we
use the definitions of upper and lower bounds on rate of convergence of estimators
proposed by Stone (1980). Let W denote the set to which p is known to belong.
lim lim sup sup Pp (|Γ̂n (γ) − Γ(γ; p)| > cn−r ) = 0.
c→∞ n p∈W
Definition 17.2 A rate r > 0 is an upper bound to the rate of convergence if for
17.2 Estimating DBD Metrics 313
lim inf sup Pp (|Γ̂n (γ) − Γ(γ; p)| > cn−r ) > 0 ∀c > 0 (17.3)
n p∈W
and
lim lim inf sup Pp (|Γ̂n (γ) − Γ(γ; p)| > cn−r ) = 1. (17.4)
c→0 n p∈W
17.2.1 Achievability
We are trying to understand the limits on rate at which the estimation error can
converge to zero as sample size n increases. Lower bounds on the achievable rate of
convergence can be shown by considering particular estimators and analyzing their
performance. This is the basic idea which leads to the first theorem in this section
where we consider the plug-in estimators, Γ̂n , for the path length Γ, i.e.,
This estimator is obtained by plugging in the kernel density estimator p̂n for data
density in place of actual density p(x) into the expression for path length Γ. The
kernel density estimator is given by
n
1 x − xi
p̂n (x) = K ,
n hdn i=1 hn
assumptions for For proving these bounds, the function q that controls the path length is assumed
the weighting to have the following properties:
function, q
[A1] q is a monotonically decreasing function
[A2] inf y q(y) > 0
[A3] q has bounded first and second derivatives
One feature of the kernel density estimator is that, when the true data density
can be assumed to be smooth (have a certain number of derivatives), its bias can
be reduced by choosing an appropriate kernel. Let us denote by Ws , the set of
assumptions for functions which have s or more continuous derivatives. We assume that p(x) has
the density, p(x) the following properties:
1. p(x) ∈ Ws
2. p(x) has bounded support
3. ∃ C1 such that || ▽ p(x)|| ≤ C1 ∀x
The smoothness parameter s measures the complexity of the class of underlying
distributions. Given that p(x) belongs to Ws , we base the density estimate on the
d-dimensional kernel K(x) = Πdj=1 k(xj ). Here k is a one-dimensional kernel with
the following properties:
k(x) = k(−x), k(x)dx = 1, sup−∞<x<∞ |k(x)| ≤ A < ∞,
m
x k(x)dx = 0, m = 1, . . . , s − 1 and 0 = xs k(x)dx < ∞.
We use the following two lemmas about well-known (cf. (Nadaraya, 1989))
properties of the kernel density estimators.
where
uµ 1
F (u, x) = (1 − T )s−1 Dµ p(x + T u)dT.
µ! T =0
|µ|=s
Theorem 17.5 (achievability) Uniformly over all pairs of points x′ and x′′ ∈
the support of p(x), the plug-in estimator dˆn (x′ , x′′ ) that uses the kernel density
s
estimator p̂n , achieves the rate of convergence r = 2s+d where the width of the
c
kernel density estimators hn = 1 , where c is a constant.
n 2s+d
Proof We begin by defining the derivative T of the functional Γ(γ; p) with respect
to changes δp(x) in p(x) to be
LE(γ)
.
T (δp; p) = q ′ (p(γ(t)))δp(γ(t))|γ ′ (t)|2 dt.
t=0
where p and p̂n are evaluated at γ(t). By a proof similar to the intermediate value
′′
theorem, we know that q(y + δy) − q(y) − δyq ′ (y) = q 2!(β) δy 2 for some β in the
domain of q. Hence, for some constant C,
LE(γ)
|Γ(γ; p̂n ) − Γ(γ; p) − T (p̂n − p; p)| ≤ C {p̂n (γ(t)) − p(γ(t))}2 |γ ′ (t)|2 dt.
t=0
Therefore,
|Γ(γ; p̂n ) − Γ(γ; p)| ≤ |T (p̂n − Ep [p̂n ] ; p)| + |T (Ep [p̂n ] − p; p)|
4 4
4 LE(γ) 4
4 2 ′ 4
+ 4C {p̂n (γ(t)) − p(γ(t))} |γ (t)|2 dt4 .
4 t=0 4
We now bound each of these three terms in turn. The variance of the first term
316 Modifying Distances
is bounded as follows:
2
Ep q ′ (p(γ(t))) {p̂n − Ep [p̂n ]} |γ ′ (t)|2 dt
t
2
′
≤ L max q (β) Ep {p̂n − Ep [p̂n ]}2 |γ ′ (t)|2 dt
β t
2
= L max q ′ (β) Ep (p̂n − Ep [p̂n ])2 |γ ′ (t)|2 dt
β t
2
(1 + ǫ1 )L2 ′
≤ max q (β) max p(x) K2 (u)du.
nhdn β x Rd
The first inequality follows from the Cauchy-Schwarz inequality, and the second
equality follows from Fubini’s theorem. The third inequality is true for sufficiently
large n by lemma 17.4. The constant L is the maximum Euclidean length of the
paths that we are considering and hence also upper-bounds the length of these
paths according to the density-based Riemannian metric. Since the variance of
T (p̂n − E p̂n ; p) is bounded as above for sufficiently large n, we can conclude that
1
T (p̂n − Ep [p̂n ] ; p) = O .
(nhdn )1/2
The second term T (Ep [p̂n ] − p; p) can be bounded in terms of the partial
derivatives of p(x) —
T (Ep [p̂n ] − p; p) = q ′ (p(γ(t)))(Ep [p̂n ] − p)|γ ′ (t)|2 dt
⎡t ⎧ ⎫ ⎤
⎨ µ ⎬
u
≤ (max q ′ (β))hsn ⎣ {Dµ p(γ(t)) + ǫ2 } K(u)du⎦ |γ ′ (t)|2 dt
t u ⎩|µ|=s µ! ⎭
= O(hsn ).
Here, we have used lemma 17.3 and the inequality follows from uniform continuity
of Dµ p and holds for sufficiently large n.
The third term, 12 (maxβ |q ′′ (β)|) t {p̂n (γ(t))−p(γ(t))}2 |γ ′ (t)|2 dt, can be bounded
by bounding the expectation of t {p̂n (γ(t)) − p(γ(t))}2 |γ ′ (t)|2 dt and then using
Markov’s inequality.
Ep {p̂n (γ(t)) − p(γ(t))}2 |γ ′ (t)|2 dt
t
2
= Ep (p̂n − f ) |γ ′ (t)|2 dt
t
LE(γ)
= (Ep [p̂n ] − p)2 |γ ′ (t)|2 dt + Ep (p̂n − Ep [p̂n ])2 |γ ′ (t)|2 dt
t t=0
(Ep [p̂n ] − p)2 p|γ ′ (t)|2 dt = O(h2s
n ).
t
1
Ep (p̂n − Ep [p̂n ])2 |γ ′ (t)|2 dt = O( d ).
t nh n
c
Collecting the three terms and assuming that hn = 1 , we conclude
n 2s+d
1 1 1
|Γ(γ; p̂n ) − Γ(γ; p)| = O( + hsn + + h2s
n ) = O( s ).
(nhdn )1/2 nhdn n 2s+d
Theorem 17.6 (upper bound) No estimator of the DBD metric can converge at
a rate faster than r = 12 .
Proof To prove this result, we show that there is a density function p(x) and
a shortest path between two points γ for which Γ̂(γ) cannot converge to Γ(γ; p)
faster than the rate r, irrespective of which estimator is used to obtain Γ̂(γ). The
technique, termed “the classification argument,” was used by Stone (1980).
Consider a density function p0 (x) with the property that the set {x : p0 (x) > α}
contains an open ball in Rd over which p0 (x) is constant. Let γ be any line segment
contained in this open ball, let xm be any point in the relative interior of γ, and
let x0 be any point in the ball which does not lie on the path γ. Since p0 (x) is
constant over the ball, any line segment including γ is the shortest path between its
two endpoints. Let ψ be a non-negative, infinitely differentiable C ∞ function with
compact support (for an example called “the blimp” see Strichartz (1995)). Define
. 1
wn (x) = δN n− 2 {ψ(x − xm ) − bn ψ(x − x0 )} .
Here, bn is chosen such that wn p0 dx = 0. We define a sequence of densities
318 Modifying Distances
Γ(γ; pn ) − Γ(γ; p0 ) 1
≥ CδN n− 2 , (17.6)
2
where C is some positive constant.
nδ 2 N 2
nEp0 wn2 (X) = p0 (x) {ψ(y) − bn ψ(y + (xp − x0 ))}2 dx < ∞
n
For sufficiently large n, we have Γ(γ; pn ) − Γ(γ; p0 ) = T (pn − p0 ; p0 ) +
LE(γ) 1 LE(γ)
O( t=0 (pn − p0 )2 |γ ′ (t)|2 dt) ≥ (δN n− 2 ), since t=0 (pn − p0 )2 |γ ′ (t)|2 dt =
O(n−1 ).
Now, we show that using Eqs. 17.5 and 17.6, we can prove the two conditions
(Eqs. 17.3 and 17.4) needed to show that 1/2 is an upper bound on the rate of
convergence. Note that this part of the proof follows closely the proof in (Stone,
1980) and we are restating it here in detail for completeness. Let μn and νn
denote the joint distribution of the i.i.d. random variables X1 , . . . , Xn under density
functions p0 and pn respectively. Let Ln denote the Radon-Nikodym derivative
dνn /dμn and set ln = loge Ln .
n
ln = log(1 + wn (Xi ))
i=1
2 3
Using the Taylor expansion log(1 + z) = z − z2 + z3 − + . . . and the fact that
|wn (x)| ≤ 0.5 for n sufficiently large,
4 4 4 n 4
4 n
4 n 4 4 n
4 4 2 4 4
4 ln − wn (Xi )4 ≤ wn (Xi ) ⇒ |ln | ≤ 4 wn (Xi )4 + wn2 (Xi ).
4 4 4 4
i=1 i=1 i=1 i=1
i=1
17.2 Estimating DBD Metrics 319
By Schwarz’s inequality
n 2 ⎡ 2 ⎤
n
1
Ep0 wn (Xi ) ≤ Ep0 ⎣ wn (Xi ) ⎦ = nEp0 wn2 (X) 2
.
i=1 i=1
1
Hence Ep0 [|ln |] ≤ nEp0 wn2 (X) 2
+ nEp0 wn2 (X) . This combined with
Eq. 17.5 yields
lim sup Ep0 [|ln |] < ∞ and lim lim sup Ep0 [|ln |] = 0 (17.7)
n δ−→0 n
Hence, there is a finite, positive M such that lim supn Ep0 [| log Ln |] < M . Choose
ǫ > 0 such that if Ln > (1 − ǫ)/ǫ or Ln < ǫ/(1 + ǫ), then | log Ln | ≥ 2M . By the
Markov inequality,
ǫ 1−ǫ 1
lim inf μn ≤ Ln ≤ > .
n 1−ǫ ǫ 2
Let n be sufficiently large so that
ǫ 1−ǫ 1
μn ≤ Ln ≤ > .
1−ǫ ǫ 2
Put prior probabilities 1/2 each on p0 and pn . Then
Ln /2 Ln
P {p = pn |X1 , . . . , Xn } = =
Ln /2 + 1/2 L+n+1
and hence
P {ǫ ≤ P {p = pn |X1 , . . . , Xn } ≤ 1 − ǫ}
5 6 5 6
Ln ǫ 1−ǫ
=P ǫ≤ ≤1−ǫ =P ≤ Ln ≤
L+n+1 1−ǫ ǫ
1 ǫ 1−ǫ 1
≥ μn ≤ Ln ≤ ≥ .
2 1−ǫ ǫ 4
Therefore any method of deciding between p0 and pn based on X1 , . . . , Xn must
have overall error probability at least ǫ/4. Apply this result to the classifier p̄ n
defined by
p0 if Γ̂n (γ) ≤ Γ(γ;p0 )+Γ(γ;p
2
n)
,
p̄n =
0 otherwise.
It follows that
1 Γ(γ; pn ) − Γ(γ; p0 )
Pp0 |Γ̂n (γ) − Γ(γ; p0 )| ≥
2 2
5 6
1 Γ(γ; pn ) − Γ(γ; p0 ) ǫ
+ Pp0 |Γ̂n (γ) − Γ(γ; pn )| ≥ ≥ ,
2 2 4
320 Modifying Distances
consequently,
5 6
Γ(γ; pn ) − Γ(γ; p0 ) ǫ
sup Pp |Γ̂n (γ) − Γ(γ; p)| ≥ ≥ ,
p∈Ws 2 4
and hence
5 6
Γ(γ; pn ) − Γ(γ; p0 )
lim inf sup Pp |Γ̂n (γ) − Γ(γ; p)| ≥ > 0.
n p∈Ws 2
This along with Eq. 17.6 proves the first requirement (Eq. 17.3) for 1/2 to be an
upper bound on the rate of convergence .
To prove the second part of the upper-bound definition, we choose a positive
integer io ≥ 2 and put prior probability i−1
o on each of the io points:
i−1
pni = p0 + (pn − p0 ).
io − 1
Now, ∃δ > 0 such that for sufficiently large n, any method of classifying p ∈
{pn1 , . . . , pnio } based on X1 , . . . , Xn must have overall probability of error 1 − 2/io.
This is because
1 + ii−1
o −1
(Ln − 1)
P {p = pni |X1 , . . . , Xn } = Ln +1
2 io
and the optimum classifier to choose between pni is
p0 if Ln < 1
p̄n = ,
pn otherwise
which produces an error whenever one of pn2 , . . . , pn(io −1) is chosen in the random
draw among the pni .
Note that
1
(pni + pn(i+1) ) − (pn(i−1) + pn(i) ) = p0 + (pn − p0 ) .
2(io − 1)
So, considering the classifier
5 6
|Γ(γ; pn ) − Γ(γ; p0 )|
p̂ = pni if |Γ̂n − Γ(γ; pni )| ≤ ,
2(io − 1)
we get
1 5 6
|Γ(γ; pn ) − Γ(γ; p0 )| 2
Ppni |Γ̂n − Γ(γ; pni )| ≥ ≥1− .
i
i o 2(i o − 1) i o
Consequently,
5 6
|Γ(γ; pn ) − Γ(γ; p0 )| 2
sup Pp |Γ̂n − Γ(γ; p)| ≥ ≥1−
p 2(io − 1) io
17.3 Computing DBD Metrics 321
5 6
|Γ(γ; pn ) − Γ(γ; p0 )|
lim lim inf sup Pp |Γ̂n − Γ(γ; p)| ≥ = 1.
io −→∞ n p∈Ws 2(io − 1)
This along with Eq. 17.6 proves the second requirement (Eq. 17.4) for 1/2 to be
an upper bound on the rate of convergence .
In section 17.2, we analyzed the effect of using an estimate of the density function
in place of the density function itself. However, even if the density were known,
computing the Riemannian metric between two points is not an easy task. This is
a variational minimization problem since the distance is defined as the infimum of
path lengths over all paths joining the points (Eq. 17.1). Isomap (Tenenbaum et al.,
2000; Bernstein et al., 2000) uses paths along a neighborhood graph to approximate
paths along a manifold embedded in Rd . Vincent and Bengio (2003); Bousquet et al.
(2004), and Chapelle and Zien (2005) propose graph-based methods to compute
density-based metrics for use in semi-supervised learning. However, these heuristics
for approximating DBD metrics are not guaranteed to lead to a consistent distance
measure, i.e., they do not guarantee convergence of the graph shortest-path length
to the Riemannian metric with increasing sample size. In this section we present
upper and lower bounds on the rate at which approximation error can converge to
zero when a particular graph construction is used for computing the Riemannian
metric.
17.3.1 Achievability
We show that the rate 1/2d is achievable, i.e., we present a graph construction
method which produces graphs such that with high probability the difference
between the shortest distance along the graph and the DBD metric is smaller than
c/n1/2d , for some constant c and for large enough n. In the proof, we use some
techniques from Tenenbaum et al. (2000) and Bernstein et al. (2000).
We first describe the method for constructing the graph and assigning weights to
the graph edges. In addition to the three assumptions made about the weighting
function q in section 17.2, we assume that
[A4] q(y) = 1 ∀ y ≤ α.
Note that this is not overly restrictive since we can choose α to be small. As
discussed in section 17.1, it is necessary to assume that q(y) does not change rapidly
for small y in order to have uniform bounds on approximation errors when using
.
graph-based lengths to approximate path lengths. Let Cp (α) = {x : p̂(x) ≥ α} and
. 7
let Cp (α; ǫ) = x∈Cp (α) B(x, ǫ) where B(x, ǫ) is a d-dimensional ball of radius ǫ
centered at x.
322 Modifying Distances
where P = (x0 , . . . , xm ) varies over all paths along the edges of g connecting x = x0
to y = xm .
To lower-bound the rate of convergence of the shortest path along graph g to the
DBD metric, we bound the difference between the graph distance and DBD metric
in theorem 17.10. For this purpose we show the DBD metric and the intermediate
distance are close to each other in lemma 17.7. Lemmas 17.8 and 17.9 state that
the graph and intermediate distances are close.
Since each outside segment has a minimum length ǫ − 2δ, dM (x, y) ≥ (ǫ − 2δ)k.
324 Modifying Distances
where
max | ▽x q(p(x))|2 ǫ
λ1 = 2 .
minx q(p(x))
Proof Let ǫ2 = dM (xi , xj )/2
7
and let B(line(xi , xj ), ǫ2 ) = x∈line(xi ,xj ) B(x, ǫ2 ).
Now,
xi + xj
Rmin |xi −xj |2 ≤ dM (xi , xj ) ≤ Rmax |xi −xj |2 and dg (xi , xj ) = |xi −xj |2 q p .
2
Using the fact that the gradient of q is bounded, we can write
xi + xj max | ▽x q(p(x))|2 ǫ
Rmax ≤ (1 + λ1 )q p ∀λ1 > 2 .
2 minx q(p(x))
Hence,
where
2δ 2 max | ▽ q(p(x))|2
λ2 = .
ǫ
Proof Since q ≤ 1, dM (xi , xj ) ≤ |xi − xj |2 . Among the exterior edges, we only
need to consider those between nodes which are within δ of the boundary of Cp (α)
or outside Cp (α). This is because of the way we approximate paths which leave
17.3 Computing DBD Metrics 325
Theorem 17.10 (lower bound on the computing error) ∀ζ < 1/2d, a com-
puting error (uniform over all pairs of points x, y) of
with λ = cn−ζ can be achieved with probability ≥ δ ′ for a sufficiently large data
sample n ≥ N (δ ′ ) (c is a constant).
Proof We show that the shortest path along the graph is within λ of the DBD
metric, by considering two cases based on the properties of the shortest path. We
define a new graph g2 on the data points which contains only a subset of the edges
in g. g2 contains all edges in g where |xi − xj |2 ≤ ǫ. In addition, it contains edges in
g that leave Cp (α; ǫ) and whose endpoints, xi and xj , lie within δ of the boundary of
Cp (α; ǫ). Note that g2 is sufficient to approximate all shortest paths between data
points. However, it is difficult to compute/generate and hence we define a more
dense graph g with the property that the extra edges are most likely not going to
be used in the shortest path unless they form a good approximation to the shortest
path along g2 .
Case (a) : The shortest path along g lies entirely within the subset g 2 .
Using the theorem from Giné and Guillou (2002), we can conclude that the choice
in section 17.2 of kernel width, hn = 11 and other properties assumed about
n 2s+d
p(x) ensure that almost surely,
*
(2s + d) log(n)
max |pn (x) − p(x)| = O 2s .
x n 2s+d
This means that for sufficiently large n, ∀ points y in Cp (α; 2ǫ) have the property
that p(y) ≥ α − α1 for arbitrarily small α1 . Using this fact and the δ-sampling
326 Modifying Distances
condition (Tenenbaum et al., 2000; Bernstein et al., 2000), we know that the
1 d 1
requirement for lemma 17.7 is satisfied when n = Ω δ log δ . This condition
is satisfied with a choice of ζ < 1/2d and letting δ = c1 n−2ζ and ǫ = c2 n−ζ )
(c1 and c2 are constants). Let λ3 = max(λ1 , λ2 ), where λ1 and λ2 are defined in
lemmas 17.8 and 17.9 respectively. Hence we can use lemmas 17.7, 17.8, and 17.9
to conclude that
6δ 8δ 2
(1 − 2λ3 )dM (x, y) ≤ dg (x, y) ≤ (1 + 2λ3 ) 1 + + 2 dM (x, y),
ǫ ǫ
which implies that
where
δ
λ4 = O(ǫ + ) = O(n−ζ ).
ǫ
Case (b) : The shortest path, P , along g uses some edges that are not part of g 2 .
Consider any edge E connecting xl and xm in the shortest path along g that is not
in g2 . We will show that there is a path through g2 that can closely approximate this
edge E and hence this shortest path. We consider the case when only one section
near the endpoint xm is more than δ in Cp (α; ǫ). The case when more sections of
E are in Cp (α; ǫ) can be similarly handled. Consider the boundary point rb where
the straight line starting at xm toward xl first touches the edge of Cp (α; ǫ). By the
δ-sampling condition, there is a data point xk within δ of rb . Consider the path
consisting of the edge xl –xk and the shortest path, P2 , between xk and xm through
those edges of g that connect nodes within ǫ of one another. Let d′g2 be the length
of a path that follows P except when it comes to edges not in g2 in which case
it follows paths P2 constructed to pass through g2 . Let dg2 be the length of the
shortest path along graph g2 . From proof of case (a), we know that
Hence,
Theorem 17.11 (upper bound on the computing error) The computing er-
ror, when using an ǫ-neighborhood-based graph on the data sample, cannot converge
to zero faster than 11 with probability ≥ δ ′ for sufficiently large data sample
n d−1
n ≥ N (δ ′ ).
Proof This result is shown using an example for which the approximation error
1
when using the graph converges at rate 1/n d−1 . Consider the case when data density
is uniform over any convex set. (Note that all continuous density functions can be
approximated by a constant function in a small enough neighborhood.) In this case
the graph construction method described at the beginning of this section reduces
to an ǫ-neighborhood graph (with high probability). Consider any two points x′ ,
x′′ in the interior of the support of the density. The shortest path between x′
and x′′ is the straight line joining them. Consider a d-dimensional cuboid which
circumscribes a cylinder of radius δ/2 around this line. If none of the points in the
data sample lie in this cuboid, the approximation error in measuring the length of
this line along the graph edges will be at least of order δ. The probability of this
happening, (1 − cδ d−1 )n , can be lower-bounded by a constant if δ is chosen to be
1
of order 1/n d−1 .
to the manifold specified by q(p(x)). In this manifold, the lengths scale locally as
q(p(x)); hence it can be verified that for any function h on Rd
1
|lM h|2 = sup | ▽x h|2 .
x q(p(x))
von Luxburg and Bousquet (2004) have shown that the 1-nearest neighbor (1NN)
a large-margin classifier corresponds to a large-margin classifier. In the case of the DBD metric,
classifier that is 1NN is equivalent to (using the modified Lipschitz constant according to the density-
equivalent to based manifold), the optimization problem
1NN using DBD
1
metric arg min sup | ▽x h|2 under constraints yi h(xi ) ≥ 1.
h x q(p(x))
1
As p(x) increases, q(p(x)) also increases and hence this optimization problem
corresponds to penalizing the gradient of the classifier function h in high-density
regions and allowing h to change in the low-density regions. This agrees with
the intuition that data points in the same high-density region are likely to have
similar labels. Please see (Bousquet et al., 2004) for a discussion on regularization
appropriate for semi-supervised learning and its relationship to modifying geometry
based on the data density.
In this section, we present experimental results on data from the UCI machine
learning repository, summarized in table 17.1. The three methods we compare
are standard 1NN, DBD metric-based 1NN, and the randomized min-cut method
(Blum et al., 2004). The randomized min-cut method involves averaging over results
obtained from several min-cuts and it is suggested by Blum et al. (2004) that those
min-cuts which lead to a very unbalanced classification are to be rejected. However,
there is no clear way to choose this cutoff ratio. For the results presented here we
choose the cutoff to be slightly smaller than the “true” ratio between the classes in
the data set. For the DBD-based 1NN implementation, we chose the function q to
fall exponentially with increase in density beyond α, which in turn was chosen to
be smaller than the estimated density at all sample points.
We performed experiments for labeled set size varying between 2 and 20 and
the accuracy results are shown in figure 17.3. We observed that DBD-based 1NN
performed better than or similar to the standard 1NN algorithm for all data sets
17.5 Conclusions and Future Work 329
70 70
% Accuracy
% Accuracy
65
65
60
60 55
90 90
85
% Accuracy
% Accuracy
80
80
70
75
60
70 # Labeled points
# Labeled points
5 10 15 20 5 10 15 20
Abalone - 5 vs 9 Digits - 1 vs 2
Figure 17.3 Classification results comparing 1NN (.), DBD-based 1NN (x) and ran-
domized min-cut (o) algorithms.
with small dimension. We conjecture that the reason DBD-based 1NN performed
worse than 1NN for the digits example is because of difficulty in density estimation
in very high dimensions. The DBD-based 1NN algorithm performed better than
the other two when the number of labeled examples was very small, except in the
case of the digits example. One interesting result was that of the two abalone data
examples, in which the randomized min-cut algorithm performed much better than
both NN algorithms in one case and much worse in the other.
We have shown that density-based distance metrics which satisfy certain properties
can be estimated consistently using an estimator obtained by plugging in the kernel
density estimate of the data distribution. In terms of s, a smoothness parameter
that corresponds to how many times data density is known to be differentiable
and d, the data dimension, we have shown that the rate of convergence of such an
s
estimator is 2s+d . We have also shown that no estimator can converge at a rate
330 Modifying Distances
faster than 12 .
This contains both good and bad news. The knowledge that we have consistent
estimation is useful when applying the method to voluminous data (for example,
webpages). However, we expect d to be high for many machine learning applications
and we might not be able to assume that the smoothness parameter, s, is very high.
Hence, when using the plug-in estimator, the convergence rate can be very slow for
high-dimensional data.
We have shown a graph construction method that can be used for consistent com-
putation of DBD metrics and shown that with high probability, the approximation
error when using this graph goes to zero faster than 1/n1/2d with high probability.
We have also shown that the shortest distance along a nearest neighborhood-based
graph on the data cannot converge to true distance faster than 1/n 1/(d−1) with high
probability. We presented semi-supervised classification results that demonstrate
that using DBD metrics can sometimes improve performance over using simple
Euclidean distance, when data density can be estimated with reasonable reliability.
While we have been able to give a theoretical understanding of DBD metrics, fur-
ther experimental investigation of their use for semi-supervised learning in needed to
make them a practically viable choice. While several papers have considered DBD
metrics, the only papers that present experimental results with real world data
use the 1NN algorithm (Lebanon, 2003; Sajama and Orlitsky, 2005). Experiments
using these metrics with other classification algorithms, using parametric density
estimation in place of the kernel density estimator, and studying alternative graph
construction and weighting methods for more accurate and efficient computation
will be of practical value.
Acknowledgements
We thank Sanjoy Dasgupta and Thomas John for several helpful discussions.
Thanks also to anonymous reviewers for several comments used in revising and
improving this paper. In particular, we thank an anonymous reviewer for pointing
out an error in the analysis of the estimation error rate in an earlier version.
V Semi-Supervised Learning in Practice
18 Large-Scale Algorithms
18.1 Introduction
ŷ(xi ) = ŷi is the value of the function ŷ on training points (labeled and unlabeled).
In chapter 11, we defined a quadratic cost (Eq. 11.11):
Minimizing this cost gives rise to the following linear system in Ŷ with regularization
hyperparameters μ and ǫ:
j WX (x, xj )ŷj
ŷ = . (18.3)
j WX (x, xj ) + ǫ
A simple way to reduce the O(kn2 ) computational requirement and O(kn) memory
requirement for training the non-parametric semi-supervised algorithms of chap-
ter 11 is to force the solutions to be expressed in terms of a subset of the examples.
reduced This idea has already been exploited successfully in a different form for other ker-
parametrization nel algorithms, e.g., for Gaussian processes (Williams and Seeger, 2001) or spectral
of solution embedding algorithms (Ouimet and Bengio, 2005).
Here we will take advantage of the induction formula (Eq. 18.3) to simplify the
linear system to m ≪ n equations and variables, where m is the size of a subset
of examples that will form a basis for expressing all the other function values. Let
S ⊂ {1, . . . , n} be a subset, with |S| = m and S ⊃ {1, . . . , l} (i.e., we take all
labeled examples in the subset). Define R = {1, . . . , n}\S (the rest of the data).
In the following, vector and matrices will be split into their S and R parts, e.g.
18.2 Cost Approximations 335
The idea is to force ŷi ∈ ŶR to be expressed as a linear combination of the ŷj ∈ ŶS
following (18.3):
j∈S Wij ŷj
∀i ∈ R, ŷi = (18.4)
j∈S Wij + ǫ
or in matrix notation
+ Ŷl − Yl 2 .
8 9: ;
CL
336 Large-Scale Algorithms
18.2.2 Resolution
Using the approximation ŶR = WRS ŶS (18.5), the gradient of the different parts
of the above cost with respect to ŶS is then
∂CSS
= 2μ DSSS − WSS + ǫI ŶS
∂ ŶS
∂CRR ⊤
= 2μWRS DR
RR − WRR + ǫI WRS ŶS
∂ ŶS
∂CRS ⊤ ⊤
= 2μ DR
SS + W DS
RS RR W RS − W RS W RS − W SR W RS ŶS
∂ ŶS
= 2μ DR SS − WSR WRS ŶS (18.7)
∂CL
= 2SSS (ŶS − Y ),
∂ ŶS
where to obtain (18.7) we have used the equality DSRR WRS = WRS , which follows
from the definition of WRS .
Recall the original linear system in Ŷ was (S + μL + μǫI) Ŷ = SY (18.2). Here
it is replaced by a new system in ŶS , written AŶS = SSS YS with
A= μ DSSS − WSS + ǫI + DR
SS − WSR WRS
⊤
+ μWRS DR
RR − WRR + ǫI WRS
+ SSS .
Since the system’s size has been reduced from n to |S| = m, it can be solved much
faster, even if A is not guaranteed1 to be sparse anymore (we assume m ≪ n).
Unfortunately, in order to obtain the matrix A, we need to compute DR RR , which
costs O(n2 ) in time, as well as products of matrices that cost O(mn2 ) if W is not
simplified cost sparse. A simple way to get rid of the quadratic complexity in n is to ignore CRR
function in the total cost. If we remember that CRR can be written
⎛ ⎞
1
CRR = μ ⎝ Wij (ŷi − ŷj )2 + ǫŶR 2 ⎠ ,
2
i,j∈R
When reducing to a subset, the loss in capacity (we can choose m values instead
of n when working with the full set) suggests we should weaken regularization,
and the smoothness constraints are a form of regularization; thus dropping some of
them is a way to achieve this goal.
For some points i ∈ R, the approximation (18.4)
j∈S Wij ŷj
ŷi =
j∈S Wij + ǫ
may be poor (e.g. for a point far from all points in S, i.e. j∈S Wij very small);
thus smoothness constraints between points in R could be noisy and detrimental
to the optimization process (this is not a big issue when considering smoothness
between a point xi in R and a point xj in S as the smoothness penalty is weighted
by Wij , which will be small if xi is far from all points in S).
Given the above considerations, ignoring the part CRR leads to the new system
which in general can be solved in O(m3 ) time (less if the system matrix is sparse).
In general, training using only a subset of m ≪ n samples will not perform as well
as using the whole data set. Carefully choosing the subset S can help in limiting
this loss in performance. Even if random selection is certainly the easiest way to
choose the points in S, it has two main drawbacks:
It may not pick points in some regions of the space, resulting in the approximation
(18.4) being very poor in these regions.
It may pick uninteresting points: the region near the decision surface is the one
where we are more likely to make mistakes by assigning the wrong label. Therefore,
we would like to have as many points as possible in S being in that region, while
we do not need points which are far away from that surface.
As a result, it is worthwhile considering more elaborate subset selection schemes,
such as the one presented in the next section.
There could be many ways of choosing which points to take in the subset. The
algorithm described below is one solution, based on the previous considerations
about the random selection weaknesses. The first step of the algorithm will be
338 Large-Scale Algorithms
to select points somewhat uniformly in order to get a first estimate of the decision
surface, while the second step will consist in the choice of points near that estimated
surface.
Equation 18.4,
j∈S Wij ŷj
ŷi = ,
j∈S Wij + ǫ
suggests that the value of ŷi is well approximated when there is a point in S near
xi (two points xi and xj are nearby if Wij is high). The idea will therefore be to
covering the cover the manifold where the data lie as well as possible, that is to say ensure that
manifold every point in R is near a point (or a set of points) in S. There is another issue we
should be taking care of: as we discard the part CRR of the cost, we must now be
careful not to modify the structure of the manifold. If there are some parts of the
manifold without any point of S, then the smoothness of ŷ will not be enforced at
such parts (and the labels will be poorly estimated).
This suggests starting with S = {1, . . . , l} and R = {l + 1, . . . , n}, then adding
samples xi by iteratively choosing the point farthest from the current subset, i.e.
the one that minimizes j∈S Wij . The idea behind this method is that it is useless
to have two points near each other in S, as this will not give extra information
while increasing the cost. However, one can note that this method may tend to
select outliers, which are far from all other points (and especially those from S).
avoiding outliers A way to avoid this is to consider the quantity j∈R\{i} Wij for a given xi . If xi
is such an outlier, this quantity will be very low (as all Wij are small). Thus, if it
is smaller than a given threshold δ, we do not take xi in the subset. The cost of
this additional check is of O((m + o)n) where o is the number of outliers: assuming
there are only a few of them (less than m), it scales as O(mn).
Once this first subset is selected, it can be refined by training the algorithm
presented in section 11.3.2 on the subset S, in order to get an approximation of the
ŷi for i ∈ S, and by using the induction formula 18.4 to get an approximation of the
discarding ŷj for j ∈ R. Samples in S which are far away from the estimated decision surface
uninformative can then be discarded, as they will be correctly classified no matter whether they
samples belong to S or not, and they are unlikely to give any information on the shape of
the decision surface. These discarded samples are then replaced by other samples
that are near the decision surface, in order to be able to estimate it more accurately.
The distance from a point xi to the decision surface is estimated by the confidence
we have in its estimated label ŷi . In the binary classification case considered here
(with targets −1 and 1), this confidence is given by |ŷi |, while in a multiclass setting
it would be the absolute value of the difference between the predicted scores of the
18.3 Subset Selection 339
two highest-scoring classes. One should be careful when removing samples, though:
we must make sure we do not leave “empty” regions. This can be done by ensuring
that j∈S Wij stays above some threshold for all i ∈ R after a point has been
removed.
Overall, the cost of this selection phase is on the order of O(mn + m3 ). It is
summarized in algorithm 18.1.
We are now in position to present the overall computational requirements for the
different algorithms proposed in this chapter. As before, the subset size m is taken
to be much smaller than the total number of points n, and the weight matrix
W may either be dense or sparse (with k non-zero entries in each row or column).
Table 18.1 summarizes time and memory requirements for the following algorithms:
340 Large-Scale Algorithms
Time Memory
2
NoSub (sparse W) O(kn ) O(kn)
3
NoSub (dense W) O(n ) O(n2 )
RandSub O(m2 n) O(m2 )
SmartSub O(m2 n) O(m2 )
NoSub: the original transductive algorithm (using the whole data set) that consists
in solving the system (18.2), as presented in chapter 11 (algorithm 11.2),
RandSub: the approximation algorithm discussed in section 18.2.2, with the subset
S being randomly chosen (section 18.3.1),
SmartSub: the same approximation algorithm as RandSub, but with S being
chosen as in section 18.3.2.
The table shows that the approximation method described in this chapter is
particularly useful when W is dense or n is very large. This is confirmed by
empirical experimentation in figure 18.1, which compares the training times (on the
benchmark data set SecStr described in chapter 21 of this book) of NoSub with a
dense kernel, NoSub with a sparse kernel, and SmartSub with a dense kernel. With
a dense kernel, NoSub becomes quickly impractical because of the need to store
(and solve) a linear system of size n = l + u, with l = 100 and u ∈ [2000, 50, 000].
With a sparse kernel (and the iterative version presented in algorithm 11.2) it scales
much better, but still exhibits a quadratic dependency in n. On the other hand,
SmartSub can handle much more unlabeled data as its training time scales only
linearly in n. We have not presented a sparse version of SmartSub since our current
code cannot take advantage of a sparse weighting function. However, this could be
useful to obtain further improvement, especially in terms of memory usage (working
with full m × m matrices can become problematic when m ≥ 10, 000).
18.4 Discussion
1500
NoSub (dense W)
NoSub (sparse W)
500
0
0 10,000 20,000 30,000 40,000 50,000
Figure 18.1 Training time (in seconds) w.r.t. the amount of unlabeled samples on
benchmark data set SecStr (cf. chapter 21). WX is a Gaussian kernel (combined with
an approximate 100-nearest-neighbor kernel in the sparse case). There are l = 100 labeled
samples, and SmartSub selects m = 500 unlabeled samples in the subset approximation
scheme. Note how the dependence of SmartSub in the total number of unlabeled samples
u ∈ [2000, 50, 000] is only linear. NoSub with dense W fails for u ≥ 10, 000 because of
memory shortage. Experiments were performed on a 3.2GHz P4 CPU with 2Gb of RAM.
19.1 Introduction
The 3D structure that a protein assumes after folding largely determines its function
in the cell. However, it is far easier to determine experimentally the primary
sequence (the string of amino acids) that make up a protein than it is to discover
protein remote its 3D structure. Through evolution, structure is more conserved than sequence,
homology so that detecting even very subtle sequence similarities, or remote homology, is
detection
344 Semi-Supervised Protein Classification Using Cluster Kernels
important for predicting structure, which can help infer function. Computational
techniques have proven very successful at aiding biologists in this task.
The major methods for homology detection can be split into three basic groups:
pairwise sequence comparison algorithms (Altschul et al., 1990; Smith and Wa-
terman, 1981), generative models for protein families (Krogh et al., 1994; Park
et al., 1998), and discriminative classifiers (Jaakkola et al., 2000; Leslie et al., 2003;
Liao and Noble, 2002). Popular sequence comparison methods such as BLAST and
Smith-Waterman are based on alignment scores. Generative models such as profile
hidden Markov models (HMMs) model positive examples of a protein family, but
these models can be trained iteratively using both positively labeled and unlabeled
examples by pulling in close homologs and adding them to the positive set. A com-
promise between these methods is PSI-BLAST (Altschul et al., 1997), which uses
BLAST to iteratively build a probabilistic profile of a query sequence and obtain a
more sensitive sequence comparison score.
Finally, classifiers such as support vector machines (SVMs) use both positive
and negative examples and provide state-of-the-art performance when used with
appropriate distance metrics (i.e., appropriate kernels) (Jaakkola et al., 2000; Leslie
et al., 2003; Liao and Noble, 2002; Saigo et al., 2004). To be more specific, to solve
this task as a classification problem, the input is the string of amino acids: the
string is typically hundreds of “characters” in length, and the characters themselves
have an alphabet of size 20. Posed as a binary classification problem, a classifier
can answer the question: “Does the given protein (amino acid sequence) belong
to structural class X or not?” and should be trained with positive and negative
examples of this class.
Building an accurate system, as in most machine learning tasks, depends critically
upon choosing a good representation of the input sequences of amino acids. The first
hurdle is that the inputs are not vectors of fixed dimension, and so to use standard
methods like SVMs one must define a similarity on sequences. This is possible by
using string kernels, whereby one embeds the strings into a vector space and then
performs inner products in this space. This issue is discussed in section 19.2. A
study of the performance of these methods compared to more classical techniques
is also detailed there.
In practice, however, relatively little labeled data are available—approximately
30,000 proteins with known 3D structure, some belonging to families and superfam-
ilies with only a handful of labeled members—whereas there are close to one million
sequenced proteins, providing abundant unlabeled data. The basic method in the
literature (Jaakkola et al., 2000; Leslie et al., 2003) to take advantage of this extra
data is to use an auxiliary method (such as PSI-BLAST) in order to add predicted
homologs of the positive training examples to the training set before training the
semi-supervised classifier. New semi-supervised learning techniques should be able to make better
learning use of these unlabeled data.
Some of the recent work in semi-supervised learning has focused on changing the
representation given to a classifier by taking into account the structure described by
the unlabeled data (Chapelle et al., 2003; Szummer and Jaakkola, 2002b). These
19.2 Representations and Kernels for Protein Sequences 345
efforts can be viewed as cases of cluster kernels, which learn similarity metrics
based on the cluster assumption: when two points are in the same “cluster” (or are
connected by a path of high density) in the original metric they should have a small
cluster kernels distance to each other in the new metric. This review describes an experimental
comparison of cluster kernels and some other competing methods on the protein
classification problem. In particular, two simple and scalable cluster kernel methods
will be described that were developed explicitly for this problem. The neighborhood
kernel (Weston et al., 2003a) uses averaging over a neighborhood of sequences
defined by a local sequence similarity measure, and the bagged kernel (Weston et al.,
2003a) uses bagged clustering of the full sequence data set to modify the base kernel.
Finally, we compare these two methods to a problem-specific solution, the profile
kernel of Kuang et al. (2004). In this kernel, each sequence is represented by a profile
estimated from a large unlabeled database (using PSI-BLAST, for example); the
profile kernel uses a substring-based feature map, but is defined on sequence profiles
rather than the sequences themselves.
In both the semi-supervised and transductive settings, these last three techniques
all provide greatly improved classification performance when used with mismatch
string kernels, and the techniques achieve equal or superior results to all previ-
ously presented cluster kernel methods that we tried. Moreover, they are far more
computationally efficient than these competing methods. The profile kernel pro-
vides perhaps the best scalability, whereas the neighborhood and bagged kernels
provide similar performance and good scaling ability, while providing more general
applicability.
The chapter is organized as follows. We begin with an overview of sequence
representations for supervised classifiers in section 19.2, followed in section 19.3 by
a review of existing cluster kernel methods for incorporating unlabeled data into
the kernel representation. In sections 19.3.2, 19.3.3, and 19.3.4, we describe the
profile, neighborhood, and bagged mismatch kernels. Finally, detailed experiments
comparing these techniques are given in section 19.4.
where d(x, y) is the pairwise score (or E-value) between x and y, and xi for
i = 1, . . . , l are the training sequences. Using SW E-values in this fashion —
the SVM-pairwise method (Liao and Noble, 2002)— gives strong classification
performance. Note, however, that SVM-pairwise is slow, both because computing
each SW score is O(|x|2 ) and because computing each empirically mapped kernel
value is O(l).
Another appealing idea is to derive the feature representation from a generative
model for a protein family. In the Fisher kernel method (Jaakkola et al., 2000),
one first builds a profile HMM for the positive training sequences, defining a
Fisher kernel log-likelihood function log P (x|θ) for any protein sequence x. Then the gradient
vector ∇θ log P (x|θ)|θ=θ0 ,where θ0 is the maximum-likelihood estimate for model
parameters, defines an explicit vector of features, called Fisher scores, for x. This
representation gives excellent classification results, but the Fisher scores must be
computed by an O(|x|2 ) forward-backward algorithm, making the kernel tractable
but slow.
It is possible to construct useful kernels directly without explicitly depending on
generative models by using string kernels. For example, the mismatch kernel (Leslie
et al., 2003) is defined by a histogram-like feature map that uses mismatches to
p-gram string capture inexact string matching. The feature space is indexed by all possible p-
kernels length subsequences α = a1 , a2 , . . . , ap , where each ai is a character in the alphabet
A of amino acids. The feature map is defined on p-gram α by Φ(α) = (φβ (α))Ap ,
where φβ (α) = 1 if α is within m mismatches of β, and 0 otherwise. The feature
map is extended additively to longer sequences: Φ(x) = p-gramsα∈x Φ(α). The
mismatch kernel can be computed efficiently using a trie data structure: the
complexity of calculating k(x, y) is O(ck (|x| + |y|)), where ck = pm+1 |A|m . For
typical kernel parameters p = 5 and m = 1 (Leslie et al., 2003), the mismatch
string kernels kernel is fast, scalable, and yields impressive performance.
Other direct string kernel methods include pair HMM and convolution kernels
(Watkins, 1999; Haussler, 1999; Lodhi et al., 2002), which are quite general but also
have complexity O(|x||y|); more recent and related string alignment kernels (Saigo
et al., 2004), also with complexity O(|x||y|); and exact-matching string kernels built
with suffix trees and suffix links, with complexity O(|x| + |y|) (Vishwanathan and
1. The E-value is the expected number of times that an alignment score as good or better
than the observed score is expected to appear by chance in a random sequence database
of the given size.
19.2 Representations and Kernels for Protein Sequences 347
60
40
30
20 SVM−SW
SVM−MISMATCH
SVM−BLAST
SVM−fisher
10
SAM
PSI−BLAST
knn−SW
0
0 0.2 0.4 0.6 0.8 1
ROC
Figure 19.1 Comparison of protein representations and classifiers without use of unla-
beled data. The graph plots the total number of families for which a given method exceeds a
ROC score threshold. The SVM-based methods use the following kernels: Smith-Waterman
empirical kernel map (SW), mismatch kernel with p = 5 and m = 1 (mismatch), BLAST
empirical kernel map (BLAST), and a Fisher kernel built from a profile HMM (Fisher).
The three non-SVM methods include a hidden Markov model (SAM), the PSI-BLAST
algorithm, and the kernel k-nearest neighbor algorithm (k=3) using a Smith-Waterman
empirical kernel map.
Smola, 2002). Inexact string matching models similar to the mismatch kernel but
with complexity O(ck (|x| + |y|)), with ck independent of alphabet size, have also
been presented (Leslie and Kuang, 2003). The motif kernel (Ben-Hur and Brutlag,
2003) uses features that are built from a fixed database of motifs; computing these
features is linear in the length of the sequence. Finally, almost all these kernels
can be constructed using the rational kernel framework of Cortes et al. (2002). We
concentrate on the mismatch kernel representation for the current work.
In figure 19.1, we summarize the results from Liao and Noble (2002) and Leslie
et al. (2003) by comparing SVM performance with these representations and other
homology detection methods on a problem that does not include the use of unlabeled
data. These and subsequent experiments are based upon the structural classification
of proteins (SCOP) (Murzin et al., 1995). This is a widely used database of
protein structures, in which proteins are organized hierarchically into classes, folds,
superfamilies, and families. SCOP contains only proteins for which the 3D structure
is available; hence, related proteins can be placed into a single superfamily even
when their amino acid sequences have diverged evolutionarily. The SCOP database
has been used in a large number of published studies of protein homology detection
348 Semi-Supervised Protein Classification Using Cluster Kernels
and fold classification algorithms. Here, we use a benchmark data set of Liao and
Noble (2002), which consists of 54 two-class remote homology detection problems.
The positive test set is a target protein family to be detected, and the positive
training set contains sequences that are only remotely related to the target. The
negative training and test sets are proteins from two disjoint sets of folds, and
contain no proteins from the target fold. All methods are evaluated by using receiver
operating characteristic (ROC) scores. More details concerning the experimental
setup can be found at http://www1.cs.columbia.edu/compbio/svm-pairwise.
The results indicate that, without unlabeled data, SVM methods using a number
of representations perform very strongly. They are superior to both HMMs (SAM-
T98 (Park et al., 1998)) and pairwise scoring functions like PSI-BLAST. We believe
that the SVM Fisher kernel computed here performs poorly because the underlying
HMMs lack sufficient training data. (For a method comparison in the more standard
setting, where domain homologs from an unlabeled database are added to the
training set, see Leslie et al. (2003); there, SVM-Fisher is competitive with the
mismatch kernel.) Note, however, that the performance of k-nearest neighbors (k-
NN) with a good representation (the SW representation) performs poorly, so choice
of classifier is also important.
It seems clear that string kernel methods with SVMs are a powerful approach,
but in a real-world setting, classifiers have access to unlabeled data. We now discuss
how to incorporate such data into the representation given to SVMs via the use of
cluster kernels.
We will focus on classifiers that re-represent the given data to reflect structure
revealed by unlabeled data. The main idea is to change the distance metric so that
the relative distance between two points is smaller if the points are in the same
cluster. If one is using kernels, rather than explicit feature vectors, one can modify
the kernel representation by constructing a cluster kernel.
Previous work of Chapelle et al. (2003) presented a general framework for
producing cluster kernels by modifying the eigenspectrum of the kernel matrix.
random walk and Two of the main methods presented are the random walk kernel and the spectral
spectral clustering kernel, which we will briefly summarize below. See chapter 15 for more
clustering kernels details on these and other spectral cluster kernels.
19.3 Semi-Supervised Kernels for Protein Sequences 349
In this section and the next, we introduce two fast and general cluster kernels that
fast cluster leverage unlabeled data to improve a base kernel representation. Unlike other cluster
kernels kernel approaches, these kernels make use of two complementary (dis)similarity
measures: a base kernel representation which implicitly makes use of features useful
for discrimination between classes, and a distance measure that describes how close
examples are to each other. In our application to protein classification, we use the
mismatch string kernel as the base kernel and standard sequence comparison metrics
(such as BLAST or PSI-BLAST E-values) as the distance measure. We note that
string kernels have proved to be powerful representations for SVM classification
(Leslie et al., 2003) but do not give sensitive pairwise similarity scores like the
BLAST family methods; thus the two sequence similarity measures play distinct
roles in the kernel definition.
For the neighborhood kernel, we use a standard sequence dissimilarity measure
neighborhood like BLAST or PSI-BLAST to define a neighborhood for each input sequence. The
kernel neighborhood Nbd(x) of sequence x is the set of sequences x′ with similarity score
to x below a fixed E-value threshold, together with x itself. Now given a fixed
original feature representation, we represent x by the average of the feature vectors
1
for members of its neighborhood: Φnbd (x) = ′ Φorig (x′ ). The
|Nbd(x)| x ∈Nbd(x)
neighborhood kernel is then defined by
′ ′
x′ ∈Nbd(x),y ′ ∈Nbd(y) korig (x , y )
knbd (x, y) = .
|Nbd(x)||Nbd(y)|
We will see in the experimental results that this simple neighborhood-averaging
350 Semi-Supervised Protein Classification Using Cluster Kernels
Figure 19.2 Neighborhood averaging for a toy data set. Feature representations
for a toy data set before (left) and after (right) the neighborhood averaging operation.
The shaded region is the union of the convex hulls of the neighborhood point sets for the
original data.
A number of existing clustering techniques are much more efficient than the methods
mentioned in section 19.3. For example, the classical k-means algorithm is O(rknd),
where n is the number of data points, d is their dimensionality, and r is the number
19.3 Semi-Supervised Kernels for Protein Sequences 351
Because k-means gives different solutions on each run, step 1 will give different
results; for other clustering algorithms one could subsample the data instead. Step
2 is a valid kernel because it is the inner product in an N k-dimensional space
Φ(x) = [cj (x) = q] : j = 1, . . . , N, q = 1, . . . , k, and products of kernels as in
step 3 are also valid kernels. The intuition behind the approach is that the original
kernel is rescaled by the “probability” that two points are in the same cluster,
hence encoding the cluster assumption. To estimate the kernel on a test sequence
x in a semi-supervised setting, one can assign x to the nearest cluster in each of
the bagged runs to compute kbag (x, xi ).2 We apply the bagged kernel method with
korig as the mismatch kernel and kbag built by running k-means on the distances
induced by PSI-BLAST.
Note that the emission probabilities, Pj+i (b), i = 1 . . . p, come from the profile
P (x )—for notational simplicity, we do not explicitly indicate the dependence on
x . Typically, the profiles are estimated from close homologs found in a large
sequence database; however, these estimates may be too restrictive for our purposes.
Therefore, we smooth the estimates using background frequencies, q(b), b ∈ A, of
amino acids in the training data set via
Pi (b) + tq(b)
P̃i (b) = , i = 1 . . . |x |,
1+t
where t is a smoothing parameter. We use the smoothed emission probabilities P̃i (b)
in place of Pi (b) in defining the mutation neighborhoods.
We now define the profile feature mapping as
ΦProfile
(p,σ) (P (x )) = (φβ (P (x [j + 1 : j + p])))β∈Ap , (19.2)
j=0...|x |−p
19.4 Experiments
Our first experiment shows that the neighborhood mismatch kernel makes better
use of unlabeled data than the baseline method of “pulling in homologs” prior to
354 Semi-Supervised Protein Classification Using Cluster Kernels
Using BLAST for homologs & neighborhoods Using PSI−BLAST for homologs & neighborhoods
60 60
50 50
Number of families
Number of families
40 40
30 30
20 20
mismatch(5,1) mismatch(5,1)
10 mismatch(5,1)+homologs 10 mismatch(5,1)+homologs
neighborhood mismatch(5,1) neighborhood mismatch(5,1)
0 0
0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1
ROC−50 ROC−50
Mismatch(5,1)+homologs ROC−50
Mismatch(5,1)+homologs ROC−50
1 1
0.8 0.8
0.6 0.6
0.4 0.4
0.2 0.2
0 0
0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1
Neighborhood Mismatch(5,1) ROC−50 Neighborhood Mismatch(5,1) ROC−50
training the SVM classifier, that is, simply finding close homologs of the positive
training examples in the unlabeled set and adding them to the positive training
set for the SVM. Homologs come from the unlabeled set (not the test set), and
“neighbors” for the neighborhood kernel come from the training plus unlabeled
data. We compare the methods using the mismatch kernel representation with
p = 5 and m = 1, as used in Leslie et al. (2003). Homologs are chosen via BLAST
or PSI-BLAST as having a pairwise E-value less than 0.05 (the default parameter
setting (Altschul et al., 1990)) with any of the positive training samples. The
neighborhood mismatch kernel uses the same threshold to choose neighborhoods.
For the neighborhood kernel, we normalize before and after the averaging operation
)
via Kij ← Kij / Kii Kjj . The results are given in figure 19.3 and table 19.1.
Figure 19.3 plots the number of families achieving a given ROC50 score. Thus,
a strongly performing method produces a curve close to the top right of the plot.
A signed rank test shows that the neighborhood mismatch kernel yields significant
improvement over adding homologs (p-value 3.9e-05). Note that the PSI-BLAST
scores in these experiments are built using the whole database of 7329 sequences
19.4 Experiments 355
Table 19.1 Mean ROC50 and ROC scores over 54 target families for semi-supervised
experiments, using BLAST and PSI-BLAST for adding homologs and defining the neigh-
borhood kernel
BLAST PSI-BLAST
ROC50 ROC ROC50 ROC
mismatch kernel 0.416 0.870 0.416 0.870
mismatch kernel + homologs 0.480 0.900 0.550 0.910
neighborhood mismatch kernel 0.639 0.922 0.699 0.923
(that is, test sequences in a given experiment are also available to the PSI-BLAST
algorithm), so these results are slightly optimistic. However, the comparison of
methods in a truly inductive setting using BLAST shows the same improvement of
the neighborhood mismatch kernel over adding homologs (p-value 8.4e-05).
The improvement from the neighborhood kernel does not come from the BLAST
and PSI-BLAST representations alone: the mean ROC50 score for these represen-
tations using an empirical map (see the transductive setting for a description) are
0.368 and 0.533 respectively without pulling in homologs, and 0.448 and 0.595 with
pulled-in homologs. Moreover, simply adding the BLAST and mismatch kernels
together (using an empirical map) without using homologs yields a mean ROC50 of
0.3943, so it is also not because the methods give independent information about
the targets which can be easily combined.
50 50
Number of families
Number of families
40 40
30 30
20 mismatch(5,1) 20 PSI−BLAST
mismatch(5,1)+homologs + close homologs
10 neighborhood mismatch(5,1) 10 spectral cluster, k=100
bagged mismatch(5,1) k=100 random walk, t=2
0 0
0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1
ROC−50 ROC−50
Figure 19.4 Comparison of protein representations and classifiers using unlabeled data
in a transductive setting. Neighborhood and bagged mismatch kernels outperform pulling
in close homologs (left) and equal or outperform previous semi-supervised methods (right).
Note: We also pull in homologs during the SVM training for the neighborhood and bagged
kernels.
Table 19.2 Mean ROC50 and ROC scores over 54 target families for transductive
experiments
ROC50 ROC
mismatch kernel 0.416 0.875
mismatch kernel + homologs 0.625 0.924
neighborhood mismatch kernel 0.704 0.917
bagged mismatch kernel (k = 100) 0.719 0.943
bagged mismatch kernel (k = 400) 0.671 0.935
PSI-BLAST kernel 0.533 0.866
PSI-BLAST+homologs kernel 0.585 0.873
spectral clustering kernel 0.581 0.861
random walk kernel 0.691 0.915
transductive SVM 0.637 0.874
the training set and the chosen close homologs. Finally, we also run transductive
SVMs. The results are given in table 19.2 and figure 19.4 (right). A signed rank
test (with adjusted p-value cutoff of 0.05) finds no significant difference between
the neighborhood kernel, the bagged kernel (k = 100), and the random walk kernel
in this transductive setting. Thus the new techniques are comparable with random
walk, but are feasible to calculate on full-scale problems.
Semi-supervised and transductive methods are most interesting and potentially give
greatest benefit in the realistic setting where a large amount of unlabeled data is
used. We therefore test the cluster kernel methods in large-scale experiments, using
101,602 Swiss-Prot protein sequences as additional unlabeled data. For simplicity,
we first give results for both the neighborhood and bagged kernels in the trans-
19.4 Experiments 357
ductive setting, that is, in the case where test sequences are available as additional
unlabeled examples in all the experiments. Then, for a clean comparison against
the profile kernel, we test the neighborhood kernel and the profile kernel in a semi-
supervised setting, where the Swiss-Prot database alone is used as the source of
unlabeled data.
For the large-scale neighborhood mismatch kernel experiments, we first compute
the entire SCOP plus Swiss-Prot kernel (108, 931 x 108, 931) matrix with mismatch
kernel parameters p = 5 and m = 1. We then apply the neighborhood averaging
operation to produce the 7329 x 7329 kernel matrix for SCOP sequences needed for
SVM training. We normalize the kernel matrix before and after the neighborhood
averaging operation. Results in table 19.3 clearly show that the inclusion of a
large amount of additional unlabeled data from Swiss-Prot significantly improves
classification performance. Moreover, the neighborhood kernel again outperforms
the baseline method of adding homologs of the positive training sequences to the
training set.
For the large-scale bagged mismatch kernel experiments, the fact that many of
the sequences in the Swiss-Prot database are multidomain protein sequences com-
plicates the clustering step: since the PSI-BLAST E-values used as the dissimilarity
metric are based on local alignment, a multidomain sequence can be similar to many
unrelated single-domain sequences, and hence the clustering algorithm may fail to
converge. As an approximate remedy, we only use Swiss-Prot protein sequences
with maximal length of 250 for the large-scale k-means clustering, reasoning that
most multidomain sequences would be eliminated by this length constraint. We ran-
domly sample 30,000 protein sequences from the set of Swiss-Prot with length 250
or less to use as unlabeled data for clustering. Since the method mainly depends on
the quality of the clusters containing the labeled points, we terminate the k-means
clustering algorithm once there are no more changes in the label assignment for the
SCOP sequences. It is worth noting that a small amount of two-domain sequences
may have length below our cutoff, but we observe that the k-means clustering algo-
rithm still behaves relatively stably. We use the same mismatch kernel parameters
for the bagged kernel as the ones we use for the small-scale bagged kernel experi-
ments. A comparison of results is shown in table 19.3. Again, bagged kernel perfor-
mance significantly improves when a large amount of unlabeled data is provided to
the clustering algorithm. Finally, we also compare with the semi-supervised profile
kernel approach. The profile kernel representation depends on estimating sequence
profiles for each input sequence using a large sequence database, and therefore we
only present results in the large-scale setting. The profile kernel performs very well.
Note that adding homologs (the baseline approach to semi-supervised learning) can
be used in conjunction with any of the cluster kernel methods. We found that this
combination of approaches improved the results in all cases.
Finally, in order to make a clean comparison of the stronger of the cluster kernels,
the neighborhood kernel, with the profile kernel, we ran a separate experiment,
where we use a semi-supervised training setup: the Swiss-Prot database alone
is used as the source of unlabeled data for estimating PSI-BLAST profiles and
358 Semi-Supervised Protein Classification Using Cluster Kernels
Table 19.3 Mean ROC50 and ROC scores over 54 target families for large-scale trans-
ductive experiments. Note: We include homologs from the unlabeled set and the test set
(SCOP + Swiss-Prot) for the training of all the SVMs, apart from the profile kernel, which
does not use any homologs.
Table 19.4 Mean ROC50 and ROC scores over 54 target families for large-scale semi-
supervised experiments. Note: We do not include homologs from the unlabeled set (Swiss-
Prot) for the training of the SVMs in these experiments.
ROC50 ROC
neighborhood mismatch kernel 0.810 0.955
profile kernel 0.842 0.980
defining sequence neighborhoods; SCOP sequences are not used for profile learning
or for neighborhood averaging. For the cleanest comparison, we do not add SCOP
homologs to the positive training set before training the SVMs. Mean ROC50 and
ROC results are given in table 19.4, and a comparison of ROC50 results over all
experiments is given in figure 19.5. Results from the cluster kernel and profile kernel
methods are similar (20 wins, 25 losses, 9 ties for the cluster kernel); a signed rank
test with a p-value threshold of 0.05 finds no significant difference in performance
between the two methods.
19.5 Discussion
(5,7.5)−Profile Kernel
0.8
0.6
0.4
0.2
0
0 0.2 0.4 0.6 0.8 1
(5,1)−Neighborhood Kernel
Figure 19.5 Comparison of neighborhood kernel and profile kernel ROC 50 performance
for large-scale semi-supervised experiments. No homologs were added to the training set
for the purpose of training the SVMs.
data and do not require diagonalization of the kernel matrix as in other cluster ker-
nel methods.
Moreover, these techniques can be applied to any problem with a meaningful local
similarity measure or distance function. A potential direction for improvement in
the neighborhood kernel would be to extract only those segments of “neighboring”
sequences that correspond to the local alignment-based E-value score; when we use
entire multidomain Swiss-Prot sequences as neighbors of a single-domain SCOP
sequence, these neighbor sequences may include long regions that are unrelated to
the SCOP domain, and hence we introduce noise in the neighborhood averaging
operation.
While we have motivated these kernels by earlier work on cluster kernels and the
cluster assumption, one can also view the neighborhood and bagged kernels as using
unlabeled data locally (from nearby sequences or the local cluster) for smoothing the
kernel representation. Related work using probabilistic models instead of unlabeled
data for smoothing includes the recently introduced Bhattacharyya kernel (Jebara
et al., 2004), which assigns a probability distribution to each example and defines
a kernel on these distributions.
We also compared to the profile-based string kernels of Kuang et al. (2004),
which are also based on a semi-supervised learning paradigm. These string kernels
are also scalable and achieve very high classification accuracy; in our experiments,
the neighborhood kernel performs similarly to the profile kernel. However, the
profile kernel method requires producing a profile for each query sequence, which is
necessarily tied to alignment. In contrast, the cluster kernels that we present here
are more general, in that any dissimilarity measure can be used for neighborhood
averaging or bagging and any base kernel chosen for the initial representation.
Moreover, for the bagged kernel any clustering algorithm, not just k-means, can be
employed. These kernels may therefore be applicable to a wider range of problems.
For example, one could use expression coherence in a set of microarray experiments
as a measure of functional similarity of genes combined with a base kernel to define
360 Semi-Supervised Protein Classification Using Cluster Kernels
cluster kernels for functional gene classification. One could also hope to further
improve performance for the protein classification task by using a more powerful
base kernel than the mismatch kernel (for example, the string alignment kernel of
Saigo et al. (2004)), though the computational expense of the improved base kernel
representation may become a concern.
All the experiments described in this chapter compare methods using a binary
classification approach, which is a setup that seems to be established for kernel-
based approaches. However, common methods like BLAST and PSI-BLAST address
the full multiclass problem directly, and the binary framework seems to favor
SVM methods and ignores the additional benefit from methods that address the
multiclass task. However, some of our recent work (Ie et al., 2005) does address this
issue by applying the profile kernel-based SVM to the multiclass fold recognition
task. The results, which are beyond the scope of this chapter to describe in detail,
indicate that semi-supervised SVMs are significantly better than PSI-BLAST when
applied to the multiclass problem as well.
Future work should extend these results by combining cluster kernels with learn-
ing methods that address other additional challenges of protein classification: fur-
ther analysis of the full multiclass problem, which potentially involves thousands of
classes; dealing with very small classes with few homologs; incorporating hierarchi-
cal labels and knowledge of relationships between classes; and dealing with missing
classes, for which no labeled examples exist.
Supplementary data and source code are available at www.kyb.tuebingen.mpg.
de/bs/people/weston/semiprot. The Spider Matlab package is available at www.
kyb.tuebingen.mpg.de/bs/people/spider.
20 Prediction of Protein Function from
Networks
20.1 Introduction
the complex mechanisms of the cell (Alberts et al., 1998). The function prediction
problem can be depicted on an undirected graph (see figure 20.1). Focusing on a
particular functional class, the task boils down to a two-class classification problem.
A protein whose class label is known is annotated +1 if it belongs to the class, −1
otherwise. The protein whose class label is to be predicted is unannotated (i.e.,
“?” in the figure). Once a graph is defined, the problem can be dealt with within
the framework of graph-based semi-supervised learning thanks to recent progress
by (Zhou et al., 2004; Belkin and Niyogi, 2003b; Zhu et al., 2003b; Chapelle et al.,
2003). See also part III in this book. The class label of an unannotated protein is
inferred from those of adjacent nodes, proportionally being affected by weights of
the edges. See section 20.2 for details.
Typically, multiple graphs are available to represent the same set of proteins
multiple data in terms of various source of information. For instance, an edge set can represent
sources, Graph physical interactions of the proteins (Schwikowski et al., 2000; Uetz et al., 2000; von
fusion Mering et al., 2002), gene regulatory relationships (Lee et al., 2002; Ihmels et al.,
2002; Segal et al., 2003a), closeness in a metabolic pathway (Kanehisa et al., 2004),
similarities between protein sequences (Yona et al., 1999), etc. (see figure 20.2).
Each source contains partially independent and partially complementary informa-
tion about the task at hand. However, no single information source is sufficient to
identify protein functions reliably. One way to enhance reliability is to integrate mul-
tiple sources. In computational biology, a number of methods have been proposed
to classify proteins based on networks such as majority vote (Schwikowski et al.,
2000; Hishigaki et al., 2001), graph-based methods (Vazquez et al., 2003), Bayesian
methods (Deng et al., 2003), discriminative learning methods (Vert and Kane-
hisa, 2003; Lanckriet et al., 2004c), and probabilistic integration by log-likelihood
scores (Lee et al., 2004). See also (Tsuda and Noble, 2004) and references therein.
20.1 Introduction 363
Figure 20.2 Multiple graphs: A set of graphs is given, each of which depicts a different
aspect of the proteins. Since different graphs contain partly independent and partly
complementary pieces of information, one can enhance the total information by combining
these graphs.
1. Recently, a fast and greedy approximation method was proposed (Bach et al., 2004),
but the worst-case complexity does not change.
364 Prediction of Protein Function from Networks
yi ∈ {−1, 1}, and the remaining u = n − l test nodes are unlabeled. The goal is to
predict the labels yl+1 , . . . , yn by exploiting the structure of the graph under the
assumption that a label of an unlabeled node is likely to be similar to the labels
of its neighboring nodes. A more adjacent or a more strongly connected neighbor
node will more significantly affect the node.
The first term corresponds to the loss function in terms of condition (a), and the
second term represents the smoothness of the scores in terms of condition (b).
The parameter c trades off loss versus smoothness. Another small regularization
n
term, μ i=l+1 ŷi2 , can be added in order to keep the scores of unlabeled nodes
in a reasonable range. However, for simplicity, we degenerate this term into the
smoothness term (b) by assuming μ = 1. Alternative choices of smoothness and loss
functions can be found in Chapelle et al. (2003). It is more prevalent to represent
(20.1) with matrices
where Y = (y1 , . . . , yl , 0, . . . , 0)⊤ , and the matrix L is called the graph Laplacian
matrix (Chung, 1997), which is defined as L = D − W where D = diag(di ),
− 12 1
di = j wij . Instead of L, the normalized Laplacian, L = D LD− 2 can be
used to get a similar result (Chung, 1997). The solution of this problem is obtained
as
Ŷ = (I + cL)−1 Y, (20.3)
This amounts to taking the upper bound of the smoothness function Ŷ ⊤ Lk Ŷ over
all graphs and applying it for regularization. To investigate the properties of the
solution of the primal problem (20.5), let us derive the dual problem in a similar
way to that of Schölkopf and Smola (2002). Then, the convex optimization problem
can be rewritten as the following min-max problem using Lagrange multipliers,
m
max min (Ŷ − Y )⊤ (Ŷ − Y ) + cγ + αk (Ŷ T Lk Ŷ − γ) − ηγ, (20.6)
α,η Ŷ ,γ
k=1
in terms of the Lagrange multipliers, the optimal solution of the primal problem
gains more interpretability. For example, for support vector machines, the analysis
using the dual problem is effectively used for explaining the basic properties of the
discriminant hyperplane (e.g., large margin and support vectors) (Schölkopf and
Smola, 2002).
Now, let us solve the inner optimization problem. By setting the derivative with
respect to γ to zero, (20.6) becomes
m
c− αk = η. (20.7)
k=1
m
Since η ≥ 0, the sum of αk is constrained as k=1 αk ≤ c. Substituting (20.7) into
(20.6), we have
m
max min (Ŷ − Y )⊤ (Ŷ − Y ) + αk Ŷ T Lk Ŷ . (20.8)
α Ŷ
k=1
By substituting (20.10), the Lagrangian (20.6) becomes the following dual prob-
lem,
m
max Y ⊤ Y − Y ⊤ (I + αk Lk )−1 Y
α k=1
m
(20.11)
αk ≤ c.
k=1
minimization problem:
m
min Y ⊤ (I + αk Lk )−1 Y
α k=1
m
(20.12)
αk ≤ c.
k=1
Denote by d(α) the dual objective function (20.12). Due to the Karush-Kuhn-
Tucker (KKT) conditions, we have αk (Ŷ ⊤ Lk Ŷ − γ) = 0 at the optimal solution.
Therefore, αk = 0 iff Ŷ ⊤ Lk Ŷ < γ, and αk > 0 iff Ŷ ⊤ Lk Ŷ = γ. If the constraint
Ŷ ⊤ Lk Ŷ ≤ γ is satisfied as an equality only for some of the graphs, we obtain a
sparse solution for αk , since the αk corresponding to the other graphs are zeros.
This implies integration with selectivity. A graph with zero weight (i.e., αk = 0) is
considered unnecessary or redundant since the optimal score vector Ŷ would not
change even if it is removed. On the other hand, a graph with non-zero weight (i.e.,
αk > 0) satisfies Ŷ ⊤ Lk Ŷ = γ, and accordingly plays an essential role in determining
the value of the score vector.
20.3.2 Optimization
We can simply solve the optimization problem, for instance, with the gradient
descent method. This requires the computation of the dual objective d(α) as well
20.4 Experiments on Function Prediction of Proteins 369
∂ ∂
by means of the relation ∂a B −1 = −B −1 ( ∂a B)B −1 . Although we have the inverse
m −1
matrix (I + k=1 αk Lk ) in the solution (20.10), the objective (20.12), and the
derivative (20.15) as well, we do not need to calculate it explicitly, because it always
m
appears as a vector form of (I + k=1 αk Lk )−1 Y , which can be obtained as the
solution of sparse linear systems. Therefore, the computational cost of the dual
objective and the derivative is nearly linear in the number of non-zero entries of
m
k=1 αk Lk (Spielman and Teng, 2004).
The graph-combining method was evaluated on the data set provided by Lanck-
riet et al. (2004c). The task is to classify the function of yeast proteins into
the 13 highest-level categories of the functional hierarchy (see table 20.1). The
function of 3588 proteins is labeled according to the MIPS comprehensive yeast
genome database (CYGD, http://mips.gsf.de/projects/fungi/yeast.html).
Note that a protein can belong to several functional classes. We solved a two-class
classification problem to determine membership or nonmembership of each func-
tional class, and evaluated the accuracy of each classification.
Table 20.2 lists the five different types of protein graphs (or networks) used in the
experiments. The graphs W1 and W5 are created from vectorial data, i.e., Pfam do-
main structure and gene expression, respectively. The graphs W2 , W3 , and W4 are
directly taken from the database in graph form, corresponding to coparticipation
in a protein complex, physical interactions, and genetic interactions, respectively.
See (Lanckriet et al., 2004c) for more information. The density of the Laplacian
matrices (i.e. the fraction of non-zero entries) is shown in the last column of the
table. All the matrices are very sparse (maximum density 0.8%), which contributes
to memory-saving. If one were to try to use a diffusion kernel, it would take much
more memory factor (1/0.007 ≈ 142). In learning, each graph was transformed into
a normalized Laplacian matrix Lk .
classes
1 metabolism
2 energy
3 cell cycle and DNA processing
4 transcription
5 protein synthesis
6 protein fate
7 cellular transportation and transportation mechanism
8 cell rescue, defense, and virulence
9 interaction with cell environment
10 cell fate
11 control of cell organization
12 transport facilitation
13 others
(5, 5, 25, 25, 10, 10, 5, 5, 10, 10, 100, 2.5, 25),
respectively.
The graph-combining method was compared with individual graphs, and with
the state-of-the-art SDP/SVM method based on the reported results (Lanckriet
et al., 2004c). We then compared integration by optimized weights with integration
by fixed weights.
When compared with individual graphs (Lk ’s), the combined graph (Lopt ) outper-
formed Lk in terms of ROC score. To test the significance of the difference, McNe-
mar’s test was conducted. In principle, McNemar’s test is used to determine whether
one learning algorithm outperforms another on a particular learning task (Diet-
20.4 Experiments on Function Prediction of Proteins 371
Table 20.2 Protein networks used in the experiment. Density shows the fraction of
non-zero entries in the respective Laplacian matrices.
terich, 1998). Figure 20.3 shows the empirical p-value distribution of McNemar’s
test. A small p-value indicates that Lopt is better than Lk . The total number of
trials amounts to 975 (= 3 repetitions×5 pairwise tests×5 CVs×13 classes). In 594
(61%) trials, there is a statistically significant difference (significance level α=0.05),
which corresponds to the leftmost bar in figure 20.3. Specifically, in each pairwise
comparison, Lopt significantly outperforms single Lk ’s in 55.31%, 58.31%, 60.03%,
68.21%, and 61.03% of the total number of trials, respectively. Figure 20.4 presents
the comparison of ROC scores between Lopt and the best performing individual
graph.
Figure 20.5 presents the comparison results between the graph-combining method
and SDP/SVM method. The ROC score of the SDP/SVM method was obtained
accuracy from Lanckriet et al. (2004c). The ROC score of the Markov random field (MRF)
method from Deng et al. (2003) is also plotted in the figure. The MRF method
is an early work which shares the same data sources as ours for yeast protein
372 Prediction of Protein Function from Networks
Figure 20.3 p-Value distribution of McNemar’s test: For most of 975 McNemar’s test
trials, L opt outperforms L k ’s. Particulary, for 61% of the total number of trials, there is
a statistically significant difference (at a significance level of α=0.05), which corresponds
to the leftmost bar in the figure.
Figure 20.4 Comparing ROC scores of combined networks and the best performing
individual graph. Within each group of bars, a blue bar corresponds to the best individual
graph, while a black bar corresponds to L opt . Across the 13 classes, L opt outperforms the
best performing individual.
function prediction. For most classes, the graph-combining method achieves high
scores, which are similar to SDP/SVM methods. In classes 11 and 13, the graph-
combining method does not perform as well as SDP/SVM (but still better than the
MRF method), which is an indication of the superior generalization performance of
the SVM. We could not perform tests of significance since the detailed experimental
results of MRF or SDP/SVM were not available.
Now, let us compare the computational time. Solving the sparse linear system,
computational which appears in the solution (20.10), the objective (20.12), and the derivative
time
20.4 Experiments on Function Prediction of Proteins 373
Figure 20.5 ROC score comparison between MRF, SDP/SVM, and L opt for 13 func-
tional protein classes: Green bars correspond to the MRF method of Deng et al. (2003);
blue bars correspond to the SDP/SVM method of Lanckriet et al. (2004c). Black bars
correspond to L opt .
(20.15), only took 1.41 seconds (standard deviation 0.013) with the Matlab com-
mand mldivide in a standard 2.2GHz PC with 1GB of memory. Solving the dual
problem (20.14) that includes multiple times of computation for the sparse lin-
ear system took 49.3 seconds (standard deviation 14.8) with the Matlab command
fmincon. In contrast, the SDP/SVM method takes several hours using a commer-
cial SDP solver (G.R.G. Lanckriet, personal communication). Thus, in the light
of its simplicity and efficiency (and hence scalability), the shorter computational
time of the graph-combining method compensates considerably for the slight loss
of accuracy against the SDP/SVM method.
1 m
A combined graph with fixed weights was defined as Lf ix = m k=1 Lk . Note that
the fixed weights correspond to the solution of (20.14) when c0 = c/m = 0.2c.
The ROC scores for all functional classes are shown in figure 20.6, together with
the weights for the graphs. The optimization of weights did not always lead to
better ROC scores (except for classes 10, 11, 13). This can be explained using
SVM theory. The graph combined with fixed weights can be regarded as an SVM
decision function with all training data points, and the graph combined with
optimized weights as an SVM decision function with only support vectors. There is
no difference in accuracy between the two decision functions. Therefore we prefer
integration with optimized weights since it has the advantage of being able to single
out important graphs for learning over redundant ones without loss of accuracy.
Looking at the weights of Lopt in the figure, W4 and W5 almost always have very
low weights, which suggests that these two graphs can be redundant for learning.
The capability of selecting more important graphs would be especially valuable as
the number of available data sources increases. There was no statistically significant
difference between Lopt and Lf ix in performance (McNemar’s test, significance level
α=0.05). Figure 20.7 presents typical ROC curves of Lopt and Lf ix for class 1.
374 Prediction of Protein Function from Networks
Figure 20.6 Prediction accuracy for 13 functional protein classes. The yellow bars and
the blue bars in the upper panel show the ROC scores of L f ix and L opt , respectively. The
middle and lower panels depict the combination weights L f ix and L opt , respectively.
Figure 20.7 ROC curve for protein functional class 1. The thin blue and thick black
curves correspond to L f ix and L opt , respectively.
The benchmark consists of eight data sets as shown in table 21.1. Three of them
were artificially created in order to create situations that correspond to certain
assumptions (cf. chapter 1); this was done to allow for relating the performance
of the algorithms to those assumptions. The five other benchmark data sets were
derived from real data. It can thus be hoped that the performance on these is
indicative of the performance in real applications.
The purpose of the benchmark was to evaluate the power of the presented
algorithms themselves in a way as neutral as possible. Thus ideally the data
preprocessing should be similar for all algorithms; in particular, it should be avoided
that in some cases it takes advantage of domain knowledge, when in others it does
not. To prevent the experimenters from using domain knowledge, we tried to obscure
structure in the data (e.g. by shuffling the pixels in the images), and even to hide the
identity of the data sets (e.g. by also shuffling the data points). Also, we used the
same number of dimensions (241) and points (1500) for most data sets in the same
attempt to obscure the origin of the data and in order to increase the comparability
of the results. However, we did provide information as to which data sets originate
from images and which from text.
All data sets are available for further research at http://www.kyb.tuebingen.
mpg.de/ssl-book/.
g241c This data set was generated such that the cluster assumption holds, i.e. the
classes correspond to clusters, but the manifold assumption does not. First, 750
points were drawn from each of two unit-variance isotropic Gaussians (i.e., from
N(µi , I)), the centers of which had a distance of 2.5 in a random direction (i.e.,
µ1 − µ2 = 2.5). The class label of a point represents the Gaussian it was drawn
from. Finally, all dimensions are standardized, i.e. shifted and rescaled to zero-mean
and unit variance. A two-dimensional projection of the data is shown on the left
side of figure 21.1.
g241c g241d
5 6
first PCA direction of remainder
0 0
−2
−4
−5 −6
−5 0 5 −6 −4 −2 0 2 4 6
direction that separates class centers direction that separates class centers
Figure 21.1 Two-dimensional projections of g241c (left) and g241d (right). Black
circles, class +1; gray crosses, class -1.
g241d This data set was constructed to have potentially misleading cluster
structure, and no manifold structure. First 375 points were drawn from each of
two unit-variance isotropic Gaussians, the centers of which have a distance of 6
in a random direction; these points form the class +1. Then the centers of two
21.1 The Benchmark 379
further Gaussians for class −1 were fixed by moving from each of the former
centers a distance of 2.5 in a random direction. Again, the identity matrix was
used as covariance matrix, and 375 points were sampled from each new Gaussian.
A two-dimensional projection of the resulting data is shown on the right side of
figure 21.1.
Digit1 This data set was designed to consist of points close to a low-dimensional
manifold embedded into a high-dimensional space, but not to show a pronounced
cluster structure. We therefore started from a system that generates artificial
writings (images) of the digit “1” developed by Matthias Hein (Hein and Audibert,
2005). The images are constructed starting from an abstract “1” implemented
as a function [0, 1]2 → {0, 1}, with the main vertical line ranging from y = 0.2
to y = 0.8 at x = 0.5. There are five degrees of freedom in this function: two
for translations ([−0.13, +0.13] each), one for rotation ([−90 ◦ , +90◦ ]), one for line
thickness ([0.02, 0.05]), and one for the length of a small line at the bottom ([0, 0.1]).
The resulting function is then discretized to an image of size 16×16. As an example,
the first data point is shown in figure 21.2 (left).
2 2
4 4
6 6
8 8
10 10
12 12
14 14
16 16
2 4 6 8 10 12 14 16 2 4 6 8 10 12 14 16
Figure 21.2 First data point from Digit1 data set. (Left) Original image. (Right) After
rescaling, adding noise, and masking dimensions (x).
We randomly sampled 1500 such images. The class label was set according to the
tilt angle, with the boundary corresponding to an upright digit. To make the task
a bit more difficult, we apply a sequence of transformations to the data as shown
in algorithm 21.1, with σ set to 0.05. The result of this transformation (except for
bias and permutation) applied to the first data point is shown in the right part of
figure 21.2.
Since the data lie close to a five-dimensional manifold, SSL methods based on the
manifold assumption are expected to improve substantially on supervised learning.
380 Analysis of Benchmarks
USPS We derived a benchmark data set from the famous USPS set of handwrit-
ten digits as follows. We randomly drew 150 images of each of the ten digits. The
digits “2” and “5” were assigned to the class +1, and all the others formed class
−1. The classes are thus imbalanced with relative sizes of 1:4. We also expect both
the cluster assumption and the manifold assumption to hold.
To prevent people from realizing the origin of this benchmark data set and
exploiting its known structure (e.g. the spatial relationship of features in the image),
we again obscured the data by application of algorithm 21.1, this time with σ = 0.1.
Figure 21.3 illustrates the impact.
2 2
4 4
6 6
8 8
10 10
12 12
14 14
16 16
2 4 6 8 10 12 14 16 2 4 6 8 10 12 14 16
Figure 21.3 Fourth data point from the USPS data set. (Left) Original image. (Right)
After rescaling, adding noise, and masking dimensions (x).
COIL The Columbia object image library (COIL-100) is a set of color images
of 100 different objects taken from different angles (in steps of 5 degrees) at a
resolution of 128 × 128 pixels (Nene et al., 1996).1 To create our data set, we first
downsampled the red channel of each image to 16 × 16 pixels by averaging over
blocks of 8 × 8 pixels. We then randomly selected 24 of the 100 objects (with
24 ∗ 360/5 = 1728 images). The set of 24 objects was partitioned into six classes of
four objects each. We then randomly discarded 38 images of each class, to leave 250
1. at http://www1.cs.columbia.edu/CAVE/research/softlib/coil-100.html
21.1 The Benchmark 381
each. Finally, we applied algorithm 21.1 (with σ = 2) to hide the image structure
from the benchmark participants. Figure 21.4 gives an illustration.
2 2
4 4
6 6
8 8
10 10
12 12
14 14
16 16
2 4 6 8 10 12 14 16 2 4 6 8 10 12 14 16
Figure 21.4 First data point from the COIL data set. (Left) Original image. (Right)
After rescaling, adding noise, and masking dimensions (x).
BCI This data set originates from research toward the development of a brain
computer interface (BCI) (Lal et al., 2004). A single person (subject C) performed
400 trials in each of which he imagined movements with either the left hand (class
-1) or the right hand (class +1). In each trial, EEG (electroencephalography) was
recorded from 39 electrodes. An autoregressive model of order 3 was fitted to each
of the resulting 39 time series. The trail was represented by the total of 117 = 39 ∗ 3
fitted parameters. We thank Navin Lal for providing these data.
Text This is the 5 comp.* groups from the Newsgroups data set and the goal is to
classify the ibm category versus the rest (Tong and Koller, 2001). We are thankful to
Simon Tong for providing this data set. A tf-idf (term frequency – inverse document
frequency) encoding resulted in a sparse representation with 11,960 dimensions.
For the benchmark, 750 points of each class have been randomly selected and the
features randomly permuted.
SecStr The main purpose of this benchmark data set is to investigate how far
current methods can cope with large-scale application. The task is to predict the
secondary structure of a given amino acid in a protein based on a sequence window
centered around that amino acid. Our data set is based on the CB513 set, 2 which
was created by Cuff and Barton and consists of 513 proteins (Cuff and Barton,
1999). The 513 proteins consist of a total of 84,119 amino acids, of which 440 were
X, Z, or B, and were therefore not considered.
2. e.g. at http://www.compbio.dundee.ac.uk/~www-jpred/data/pred_res/
382 Analysis of Benchmarks
For the remaining 83,679 amino acids, a symmetric sequence window of amino
acids [-7,+7] was used to generate the input x. Positions before the beginning or
after the end of the protein are represented by a special (21st) letter. Each letter
is represented by a sparse binary vector of length 21 such that the position of
the single 1 indicates the letter. The 28,968 α-helical and 18,888 β-sheet protein
positions were collectively called class -1, while the 35,823 remaining points (“coil”)
formed class +1.
We supplied another 1,189,472 unlabeled data points. However, none of the
benchmark participants chose to utilize these data in their experiments.
transductive We decided to carry out the experiments in the transductive setting (cf. chapter 1):
setting the test set coincides with the set of unlabeled points. First, this is most economical
in terms of the required amount of data points. Second, this poses the smallest
requirements to participating methods. Otherwise it would have been necessary
to develop and implement “out-of-sample extensions” (e.g. Bengio et al. (2004b))
for the inherently transductive algorithms. We expect the prediction accuracy on
the unlabeled points to be similar to that achieved on out-of-sample points (after
having trained on the same sets of labeled and unlabeled points). Recall, however,
that transductive methods have to be retrained for every new set of test data, which
may be prohibitive in some practical applications. On the other hand, the retraining
offers the potential to learn from an increasing amount of unlabeled data, namely
the accumulated test points. This potential is wasted when an inductive classifier
is trained only once and from there on used.
An important question is how many labeled points are required to achieve decent
classification accuracy. To shed some light on this, we equipped the benchmark data
numbers of sets with subsets of labeled points of different sizes. More precisely, the number of
labeled points labeled points is either 10 or 100 for all data sets except for SecStr, for which it
is 100, 1000, or 10,000. In order to make the accuracy estimates derived from the
experiments robust and independent of coincidental properties of the chosen points,
we devised twelve subsets for each combination of data set and number of labeled
points (ten for data set SecStr). When choosing the subsets of labeled points, we
take care to pick at least one point from each class.
Since the unions of the sets of labeled points already cover substantial parts
of the entire data sets, we provided the labels of all points to the participants of
the benchmarks. This allowed for finding hyperparameter values by minimizing the
model selection test error, which is not possible in real applications; however, the results of this
procedure can be useful to judge the potential of a method. To obtain results that
are indicative of real world performance, the model selection has to be performed
using only the small set of labeled points.
21.2 Application of SSL Methods 383
Table 21.2 TSVM results. For the linear kernel the algorithm described in chapter 6
has been used, for the nonlinear kernel the one of (Chapelle and Zien, 2005).
A major problem in the application of SSL methods to problems with very few
labeled data points is the model selection. In the following we describe for each
method how this was approached. Unless mentioned otherwise, the experiments
have been conducted by the authors of the corresponding chapters.
Several authors have provided results corresponding to different variations of
their algorithm. In order to keep the final results table as concise as possible, we
have in these cases compared them and preselected the best one.
Finally, all the results reported on the tables below are test errors in %.
Thorsten Joachims has reported results for the transductive support vector machine
(TSVM) algorithm described in chapter 6 and for the spectral graph transducer
(SGT) (Joachims, 2003). He used the code available on his webpage.
No model selection or parameter tuning has been performed, and according
to Joachims the results are likely to be improved by appropriate preprocessing
and/or model selection. For TSVM, a linear kernel was used and C was fixed to
C −1 = n1 ni=1 ||xi ||2 . For the SGT algorithm, the hyperparameters were set as in
(Joachims, 2003): C = 3200, d = 80, and k = 100.
Since we believe that on some data sets nonlinearity might be important, we ran
our own implementation of TSVM (Chapelle and Zien, 2005) with a radial basis
function (RBF) kernel. Its width was chosen as the median of the pairwise distances
and C was fixed to 100. Results are presented in table 21.2. The tables at the end
of this chapter will refer to the nonlinear version.
The method described in chapter 9 can be kernelized, but the experiments have been
reported using a linear classifier. The hyperparameters (λ and weight decay) have
been chosen by cross-validation. In case of a tie the smaller λ and the larger weight
decay have been selected. Since the algorithm is similar to TSVM (cf. section 21.2.1)
384 Analysis of Benchmarks
Table 21.3 Performances of the entropy regularization method (cf. chapter 9). Because
of the links with TSVM and the use of a linear classifier, the comparison with linear TSVM
(see table 21.2) is relevant.
and a linear class of function has been used, we decided to compare the results with
those of the linear TSVM. The comparison is shown in table 21.3.
The experiments were run using the distributed propagation data-dependent reg-
ularization, which is applicable to both relational data and data derived from a
metric. The most important modeling decision in applying data-dependent regu-
larization is the selection of the regions that bias label similarity. In the absence
of domain knowledge, k-nearest neighbor regions, centered at each data point as
induced by the default Euclidean distance metric, were considered.
In order to determine the number of points in each region tenfold cross-validation
experiments were run. For this purpose, points that were graph-disconnected from
training data were always treated as errors; this encouraged selecting a k that makes
the information regularization graph connected.
The weight of labeled training data against unlabeled data, λ, was set to 0,
meaning that the posterior labels of training data were not allowed to change
from their given values. The regularization iteration proceeded until the change
in parameters became insignificant.
As a result of data-dependent regularization, each previously unlabeled point
now had a probabilistic class label. This probabilistic class label was converted to a
real label by thresholding the probability. The threshold was applied as an additive
term to the log probability of each class. Then the class assigned by the classifier
was determined by maximizing the threshold-adjusted (log) probability.
Proper selection of the threshold requires cross-validation. However, for computa-
tional efficiency reasons the authors cross-validated only between two scenarios: the
first, in which the threshold applied to each class is 0, which corresponds to treating
the output of information regularization as plain probabilities; and the second, in
which the threshold of each run is optimized so that the resulting class frequency
on the unlabeled data matches the empirical class frequency on the labeled ob-
servations. Data sets 1, 3, 4, and 6 preferred the first algorithm for selecting the
threshold (that is, no threshold), while data sets 2, 5, and 7 preferred the second
algorithm.
21.2 Application of SSL Methods 385
Table 21.4 Influence of the class mass normalization (CMN, cf. chapter 11)
A fully connected graph with an RBF kernel has been chosen for the algorithm
described in chapter 11. More precisely, the cost function (11.11) is minimized
giving the closed-form solution (11.12). The kernel bandwidth σ was selected in the
following way:
For data sets with 100 labeled examples, by cross-validation on the first split, the
same σ being used on all other splits
For data sets with 10 labeled examples, with the following basic heuristic: σ = d/3,
where d is the estimated average distance between a point in the data set and its
10th nearest neighbor
The tradeoff coefficient μ was set to 10−6 .
As shown in table 21.4, the class mass normalization (cf. section 11.5) seemed to
be very important, and later results are reported using this technique.
K̃(x, z) = K(x, z) − k⊤ p
x (I + rL G)
−1 p
L kz ,
where K(x, z) is a base kernel, [kx ]i = K(xi , x), G is the Gram matrix (of size
l + u), L is the normalized graph Laplacian, and r is the ratio γγAI .
The base kernel was chosen to be an RBF with width σ. L is computed from
an adjacency matrix W corresponding to a weighted k nearest neighbors graph
||x −x ||2
with weights Wij = exp − i2σGj if there is an edge between xi and xj , and
0 otherwise. The width σG is fixed as the mean distance between adjacent nodes
on this graph. The adjacency matrix is symmetrized by setting Wij = Wji for
any non-zero edge weight Wji . The normalized graph Laplacian is computed as
L = I −D−1/2 W D−1/2 where D is a diagonal degree matrix given by Dii = j Wij .
386 Analysis of Benchmarks
For all data sets except Text, k = 5, p = 2 was used. For Text, those values
are k = 50, p = 5. This is based on the experimental experience of the authors:
relatively smaller values of k and p tend to work well for image data sets and larger
values are useful for textual data sets. No further optimization on these parameters
was attempted.
For the multiclass data set, a one-vs.-the-rest strategy was used. For each of
the classifiers, the bias b was selected such that a sixth of the unlabeled data was
classified in the positive class (because of a prior on uniform class probabilities for
the six classes).
The hyperparameters were chosen by performing a search over the following grid:
1. regularization parameter γA ∈ {10−6, 10−4 , 10−2 , 1, 100};
2. base kernel width σ ∈ { σ80 , σ40 , σ20 , σ0 , 2σ0 , 4σ0 , 8σ0 }, where σ0 is the mean norm
of the feature vectors in the data set;
γI
3. ratio r = γA ∈ {0, 10−4, 10−2 , 1, 100, 104, 106 }.
For data sets COIL and SecStr, the best mean test error across splits was
reported. For other data sets, the model selection criterion used was either
fivefold cross-validation error for 100 labeled points, or
y ⊤ Lp y
for 10 labeled points, the normalized cut, |i,yi =1| |i,yi =−1| , where y is the vector
of predicted labels.
Data set 8 was treated differently due to its size. The linear Laplacian support
vector machine/regularized least squares (SVM/RLS) was run as described in
section 12.5 and (Keerthi and DeCoste, 2005). The values k = 5, p = 4 were
set based on a crude search. Efficient nonlinear methods are currently under
development and may possibly return better performance on this data set.
It is important to note that Laplacian SVM/RLS also provides out-of-sample
prediction on completely unseen test points. Experimental results on data set
Digit1 are provided in chapter 12 Results are presented on table 21.5. Since
LapRLS achieved slightly better performances, we consider this method for the
table at the end of this chapter.
Table 21.5 Semi-supervised kernel (chapter 12). No MS stands for “no model selection”:
this is the best mean test error achieved across all hyperparameter values.
Table 21.6 First line: number of components kept in the dimensionality reduction;
second line: “true” manifold dimension; third line: estimate of the manifold dimension
according to the method described in (Hein and Audibert, 2005). 3
The large-scale methods described in chapter 18 use a small set of size m on which
to expand the solution. m was fixed to 100, except for the large-scale data set
SecStr where m was set to 1000.
The length scale σ was selected as explained in section 21.2.4, except that for
ten labeled points, the distance d used in the heuristic σ = d/3 is calculated as the
average distance between a point and its 10th nearest neighbor among m + 10 other
points randomly selected.
This method is not described in the book, but in (Chapelle and Zien, 2005). The
code used to run the experiments is available at http://www.kyb.tuebingen.mpg.
de/bs/people/chapelle/lds/. The hyperparameter ρ is found by cross-validation,
the other hyperparameters being fixed to their default values. The reason for
390 Analysis of Benchmarks
not optimizing on more hyperparameters is that the the model selection becomes
unreliable, especially with only ten labeled points. Note that if the number k of
nearest neighbors is optimized on the test error, the test error can be dramatically
decreased: for instance, on Digit1 with ten labeled points, a test error of 3.7% was
achieved with k = 5. This has to be compared to the 15.6% achieved by cross-
validation on ρ only.
21.2.12 Boosting
Ayhan Demiriz ran experiments on data set SecStr using the assemble algorithm
(Bennett et al., 2002), which is a modified version of AdaBoost for semi-supervised
learning. It turns out that the algorithm was not very well suited for a very small
number of labeled points, as the algorithm stops whenever a weak learner correctly
classifies all labeled points. On the other hand, it seems much better suited for large
data sets, because the run time increases only linearly in the number of labeled and
unlabeled points.
The weak learner was a two-level decision tree. AdaBoost and Assemble have both
been run for 50 iterations. In this case, it seems that semi-supervised learning was
not helpful: AdaBoost achieved 30.8% test error, while Assemble achieved 32.2%.
Table 21.9 Test errors (%) with 10 labeled training points. Values printed in italics
were obtained by performing model selection w.r.t. the test error.
Table 21.10 ROC scores (area under curve; %) with 10 labeled training points.
Table 21.11 Test errors (%) with 100 labeled training points. Values printed in italics
were obtained by performing model selection w.r.t. the test error.
Table 21.12 ROC scores (area under curve; %) with 100 labeled training points.
is uniformly better than the others, and that for a given semi-supervised learning
problem, the algorithm needs to be selected carefully as a function of the nature
of the data set. A general rule (which seems obvious a posteriori) is that manifold-
based algorithms should be used for manifold-like data sets, and cluster-based
algorithms should be used for cluster-like data sets.
model selection It should be also noted that model selection was challenging for most of the
competitors, especially in the case of only 10 labeled points, where the use of cross-
validation can be unreliable. In this respect, the results with 100 labeled points are
expected to be more reliable and to give a better indication of the strength of the
different algorithms.
limits of One of the disappointing results of this benchmark is the Text data set. Indeed,
semi-supervised it has been shown that semi-supervised learning can be very useful for this type
learning
21.3 Results and Discussion 393
Table 21.13 Results for SecStr for different numbers of labeled points. (Left) Test error
(%). (Right) ROC score (%). Values printed in italics were obtained by performing model
selection w.r.t. the test error.
of data (Joachims, 1999; Nigam et al., 2000; Chapelle and Zien, 2005), but the
results from tables 21.9 and 21.11 exhibit only a moderate improvement over plain
supervised learning. The fact that the data set has been constructed in a one-vs-rest
setting could be a possible explanation (cf. section 21.1.1). To test this hypothesis
we tried to classify only two topics, namely ibm and x. A linear SVM achieved a
mean test error of 12% (over several subsets of 100 labeled points), while a linear
TSVM was able to reduce the test error to 2%. Further investigation is required to
understand why such a large improvement is possible in this case.
Finally, it is worth pointing out that one should not necessarily expect an
improvement with unlabeled data. The data sets BCI and SecStr seem to be
examples where it is difficult to do better than standard supervised learning.
At least for SecStr, this might be a problem of the amounts of unlabeled data
that are utilized. Current approaches to protein secondary structure prediction use
essentially all known protein sequences, which amount to tens or hundreds of million
unlabeled data points. This is only possible due to the use of a very simple strategy:
roughly speaking, each protein is represented by an average of the proteins in its
neighborhood (Rost and Sander, 1993). Clearly, bringing the more sophisticated
(and probably more powerful) SSL methods to this scale is an important open
problem.
In all cases, we believe that there is no “black box” solution and that a good
understanding of the nature of the data is required to perform successful semi-
supervised learning. Indeed, in supervised learning, it seems that a good generic
learning algorithm can perform well on a lot of real-world data sets without specific
domain knowledge. In contrast, semi-supervised learning is possible only due to
the special form of the data distribution that correlates the label of a data point
with its situation within the distribution; therefore it seems much more difficult
to design a general semi-supervised classifier. Instead, powerful semi-supervised
learning algorithms distinguish themselves through the ability to make use of
available prior knowledge about the domain and data distribution, in order to relate
data and labels and improve classification. 4
4. Part of this paragraph has been inspired by comments from Adrian Corduneanu.
VI Perspectives
22 An Augmented PAC Model for Semi-
Supervised Learning
The standard PAC (probably approximately correct) learning model has proven
to be a useful theoretical framework for thinking about the problem of supervised
learning. However, it does not tend to capture the assumptions underlying many
semi-supervised learning methods. In this chapter we describe an augmented version
of the PAC model designed with semi-supervised learning in mind, that can be used
to help think about the problem of learning from labeled and unlabeled data and
many of the different approaches taken. The model provides a unified framework
for analyzing when and why unlabeled data can help, in which one can discuss both
sample-complexity and algorithmic issues.
Our model can be viewed as an extension of the standard PAC model, where in
addition to a concept class C, one also proposes a compatibility function: a type of
compatibility that one believes the target concept should have with the underlying
distribution of data. For example, it could be that one believes the target should
cut through a low-density region of space, or that it should be self-consistent in
some way, as in co-training. This belief is then explicitly represented in the model.
Unlabeled data are then potentially helpful in this setting because they allow one
to estimate compatibility over the space of hypotheses, and to reduce the size of
the search space from the whole set of hypotheses C down to those that, according
to one’s assumptions, are a priori reasonable with respect to the distribution.
After proposing the model, we then analyze sample-complexity issues in this
setting: that is, how much of each type of data one should expect to need in order
to learn well, and what are the basic quantities that these numbers depend on. We
provide examples of sample-complexity bounds both for uniform convergence and
ǫ-cover-based algorithms, as well as several algorithmic results.
398 An Augmented PAC Model for Semi-Supervised Learning
22.1 Introduction
As we have already seen in the previous chapters, there has been growing interest
in using unlabeled data together with labeled data in machine learning, and a
number of different approaches have been developed. However, the assumptions
these methods are based on are often quite distinct and not captured by standard
theoretical models.
One difficulty from a theoretical point of view is that standard discriminative
learning models do not really capture how and why unlabeled data can be of
help. In particular, in the PAC model there is purposefully a complete disconnect
between the data distribution D and the target function f being learned (Valiant,
1984; Blumer et al., 1989; Kearns and Vazirani, 1994). The only prior belief is
that f belongs to some class C: even if D is known fully, any function f ∈ C
is still possible. For instance, it is perfectly natural (and common) to talk about
the problem of learning a concept class such as DNF (disjunctive normal form)
formulas (Linial et al., 1989; Verbeurgt, 1990) or an intersection of halfspaces
(Baum, 1990; Blum and Kannan, 1997; Vempala, 1997; Klivans et al., 2002) over
the uniform distribution; but clearly in this case unlabeled data are useless — you
can just generate the data yourself. For learning over an unknown distribution
(the standard PAC setting), unlabeled data can help somewhat, by allowing one to
use distribution-specific sample-complexity bounds, but this does not seem to fully
capture the power of unlabeled data in practice.
In generative-model settings, one can easily talk theoretically about the use
of unlabeled data, e.g., (Castelli and Cover, 1995, 1996). However, these results
typically make strong assumptions that essentially imply that there is only one
natural distinction to be made for a given (unlabeled) data distribution. For
instance, a typical generative-model setting would be that we assume positive
examples are generated by one Gaussian, and negative examples are generated by
another Gaussian. In this case, given enough unlabeled data, we could in principle
recover the Gaussians and would need labeled data only to tell us which Gaussian is
the positive one and which is the negative one.1 This is too strong an assumption for
most real-world settings. Instead, we would like our model to allow for a distribution
over data (e.g., documents we want to classify) where there are a number of plausible
distinctions we might want to make.2 In addition, we would like a general framework
that can be used to model many different uses of unlabeled data.
In this chapter, we present a PAC-style framework that bridges these positions
1. Castelli and Cover (1995, 1996) do not assume Gaussians in particular, but they do
assume the distributions are distinguishable, which from our perspective has the same
issue.
2. In fact, there has been recent work in the generative model setting on the practical side
that goes in this direction (see (Nigam et al., 2000; Nigam, 2001)). We discuss connections
to generative models further in section 22.5.2.
22.1 Introduction 399
and which we believe can be used to help think about many of the ways unlabeled
data are typically used, including approaches discussed in other chapters. This
framework extends the PAC model in a way that allows one to express not only the
form of target function one is considering but also relationships that one hopes the
target function and underlying distribution will possess. We then analyze sample-
complexity issues in this setting: that is, how much of each type of data one should
expect to need in order to learn well, and also give examples of algorithmic results
in this model.
Specifically, the idea of the proposed model is to augment the PAC notion of a
concept class, which is a set of functions (like linear separators or decision trees),
with a notion of compatibility between a function and the data distribution that we
main idea hope the target function will satisfy. Then, rather than talking of “learning a concept
class C,” we will talk of “learning a concept class C under compatibility notion χ”.
For example, suppose we believe there should exist a good linear separator, and that
furthermore, if the data happen to cluster, then this separator probably does not
slice through the middle of any such clusters. Then we would want a compatibility
notion that penalizes functions that do, in fact, slice through clusters. In this
framework, the extent to which unlabeled data help depends on two quantities:
first, the extent to which the true target function satisfies the given assumption,
and second, the extent to which the distribution allows this assumption to rule
out alternative hypotheses. For instance, if the data do not cluster at all, then all
functions equally satisfy this compatibility notion and the assumption ends up not
helping. From a Bayesian perspective, one can think of this as a PAC model for a
setting in which one’s prior is not just over functions, but also over how the function
and underlying distribution relate to each other.
To make our model formal, we will need to ensure that the degree of compatibility
be something that can be estimated from a finite sample. To do this, we will require
that the compatibility notion χ actually be a function from C × X to [0, 1], where
the compatibility of a function f with the data distribution D is Ex∼D [χ(f, x)].
The degree of incompatibility is then something we can think of as a kind of
“unlabeled error rate” that measures how a priori unreasonable we believe some
proposed hypothesis to be. For instance, in the example above of a “margin-style”
compatibility, we could define χ(f, x) to be an increasing function of the distance of
x to the separator f . In this case, the unlabeled error rate, 1 − χ(f, D), is a measure
of the probability mass close to the proposed separator. In co-training, where each
example x has two “views” (x = x1 , x2 ), the underlying belief is that the true
target c∗ can be decomposed into functions c∗1 , c∗2 over each view such that for
most examples, c∗1 (x1 ) = c∗2 (x2 ). In this case, we can define χ(f1 , f2 , x1 , x2 ) = 1
if f1 (x1 ) = f2 (x2 ), and 0 if f1 (x1 ) = f2 (x2 ). Then the compatibility of a hypothesis
f1 , f2 with an underlying distribution D is Prx1 ,x2 ∼D [f1 (x1 ) = f2 (x2 )].
This setup allows us to analyze the ability of a finite unlabeled sample to reduce
our dependence on labeled examples, as a function of the compatibility of the target
function (i.e., how correct we were in our assumption) and various measures of the
“helpfulness” of the distribution. In particular, in our model, we find that unlabeled
400 An Augmented PAC Model for Semi-Supervised Learning
belongs to a given class C. For a given hypothesis f , the (true) error rate of f is
defined as err(f ) = errD (f ) = Prx∼D [f (x) = c∗ (x)]. For any two hypotheses
f1 , f2 ∈ C, the distance with respect to D between f1 and f2 is defined as
d(f1 , f2 ) = dD (f1 , f2 ) = Prx∼D [f1 (x) = f2 (x)]. We will use err(f
< ) to denote
the empirical error rate of f on a given labeled sample and d(fˆ 1 , f2 ) to denote the
empirical distance between f1 and f2 on a given unlabeled sample.
We define a notion of compatibility to be a mapping from a hypothesis f and a
distribution D to [0, 1] indicating how “compatible” f is with D. In order for this to
be estimable from a finite sample, we require that compatibility be an expectation
over individual examples. (Though one could imagine more general notions with
legal notion of this property as well.) Specifically, we define:
compatibility
Definition 22.1 A legal notion of compatibility is a function χ : C × X → [0, 1]
where we (overloading notation) define χ(f, D) = Ex∼D [χ(f, x)]. Given a sample
S, we define χ(f, S) to be the empirical average over the sample.
Remark 22.2 One could also allow compatibility functions over k-tuples of exam-
ples, in which case our (unlabeled) sample-complexity bounds would simply increase
by a factor of k. For settings in which D is actually known in advance (e.g., trans-
ductive learning; see section 22.5.1) we can drop this requirement entirely and allow
any notion of compatibility χ(f, D) to be legal.
3. For more discussion regarding co-training see also chapter 2 in this book.
22.3 Sample Complexity Results 403
We now present several sample-complexity bounds that fall out of this framework,
showing how unlabeled data, together with a suitable compatibility notion, can
reduce the need for labeled examples.
The basic structure of all of these results is as follows. First, given enough unla-
beled data (where “enough” will be a function of some measure of the complexity
of C and possibly of χ as well), we can uniformly estimate the true compatibilities
of all functions in C by their empirical compatibilities over the sample. Then, by
using this quantity to give a preference ordering over the functions in C, we can
reduce “C” down to “the set of functions in C whose compatibility is not much
larger than the true target function” in bounds for the number of labeled examples
needed for learning. The specific bounds differ in terms of the exact complexity
measures used (and a few other issues such as stratification and realizability) and
we provide examples illustrating when certain complexity measures can be signifi-
cantly more powerful than others. In particular, ǫ-cover bounds (section 22.3.3) can
provide especially good bounds for co-training and graph-based settings.
404 An Augmented PAC Model for Semi-Supervised Learning
We begin with uniform convergence bounds (later in section 22.3.3 we give tighter
ǫ-cover bounds that apply to algorithms of a particular form). For clarity, we begin
with the case of finite hypothesis spaces where we measure the “size” of a set
of functions by just the number of functions in the set. We then discuss several
issues that arise when considering infinite hypothesis spaces, such as what is an
appropriate measure for the “size” of the set of compatible functions, and the
need to account for the complexity of the compatibility notion itself. Note that in
the standard PAC model, one typically talks of either the realizable case, where we
assume that c∗ ∈ C, or the agnostic case where we do not (see (Kearns and Vazirani,
1994)). In our setting, we have the additional issue of unlabeled error rate, and can
either make an a priori assumption that the target function’s unlabeled error is low,
or else aim for a more “Occam-style” bound in which we have a stream of labeled
examples and halt once they are sufficient to justify the hypothesis produced.
We first give a bound for the “doubly realizable” case.
Proof The probability that a given hypothesis f with err unl (f ) > ǫ has
< unl (f ) = 0 is at most (1 − ǫ)mu < δ/(2|C|) for the given value of mu . There-
err
fore, by the union bound, the number of unlabeled examples is sufficient to ensure
that with probability 1 − δ/2, only hypotheses in CD,χ (ǫ) have err
< unl (f ) = 0. The
number of labeled examples then similarly ensures that with probability 1 − δ/2,
none of those whose true error is at least ǫ have an empirical error of 0, yielding
the theorem.
So, if the target function indeed is perfectly correct and compatible, then theo-
rem 22.5 gives sufficient conditions on the number of examples needed to ensure
that an algorithm that optimizes both quantities over the observed data will, in
fact, achieve a PAC guarantee. To emphasize this, we will say that an algorithm
efficiently PACunl -learns the pair (C, χ) if it is able to achieve a PAC guarantee
using time and sample sizes polynomial in the bounds of theorem 22.5.
We can think of theorem 22.5 as bounding the number of labeled examples we
need as a function of the “helpfulness” of the distribution D with respect to our
Interpretation notion of compatibility. That is, in our context, a helpful distribution is one in
which CD,χ (ǫ) is small, and so we do not need much labeled data to identify a good
function among them. We can get a similar bound in the situation when the target
function is not fully compatible:
22.3 Sample Complexity Results 405
Theorem 22.6 Given t ∈ [0, 1], if we see mu unlabeled examples and ml labeled
examples, where
2 4 1 2
mu ≥ 2 ln |C| + ln and ml ≥ ln |CD,χ (t + 2ǫ)| + ln ,
ǫ δ ǫ δ
then with probability at least 1 − δ, all f ∈ C with err(f < unl (f ) ≤ t + ǫ
< ) = 0 and err
have err(f ) ≤ ǫ, and furthermore all f ∈ C with errunl (f ) ≤ t have err
< unl (f ) ≤
t + ǫ.
In particular, this implies that if err unl (c∗ ) ≤ t and err(c∗ ) = 0 then with high
probability the f ∈ C that optimizes err(f
< ) and err < unl (f ) has err(f ) ≤ ǫ.
Proof Same as theorem 22.5 except apply Hoeffding bounds (see Devroye et al.
(1996)) to the unlabeled error rates.
Finally, we give a simple Occam/luckiness type of bound for this setting. Given
a sample S, let us define descS (f ) = ln |CS,χ (err < unl (f ))|. That is, descS (f ) is
the description length of f (in “nats”) if we sort hypotheses by their empirical
compatibility and output the index of f in this ordering. Similarly, define ǫ-
descD (f ) = ln |CD,χ (err unl (f ) + ǫ)|. This is an upper bound on the description
length of f if we sort hypotheses by an ǫ-approximation to their true compatibility.
Then we can get a bound as follows:
Theorem 22.7 For any set S of unlabeled data, given ml labeled examples, with
< ) = 0 and descS (f ) ≤ ǫml −ln(1/δ)
probability at least 1−δ, all f ∈ C satisfying err(f
have err(f ) ≤ ǫ. Furthermore, if |S| ≥ ǫ2 [ln |C| + ln 2δ ], then with probability at least
2
The point of this theorem is that an algorithm can use observable quantities to
Interpretation determine if it can be confident. Furthermore, if we have enough unlabeled data,
the observable quantities will be no worse than if we were learning a slightly less
compatible function using an infinite-size unlabeled sample.
Note that if we begin with a non-distribution-dependent ordering of hypotheses,
inducing some description length desc(f ), and our compatibility assumptions turn
out to be wrong, then it could well be that descD (c∗ ) > desc(c∗ ). In this case our
use of unlabeled data would end up hurting rather than helping.
To reduce notation, we will assume in the rest of this chapter that χ(f, x) ∈ {0, 1}
so that χ(f, D) = Prx∼D [χ(f, x) = 1]. However, all our sample complexity results
can be easily extended to the general case.
For infinite hypothesis spaces, the first issue that arises is that in order to achieve
uniform convergence of unlabeled error rates, the set whose complexity we care
about is not C but rather χ(C) = {χf : f ∈ C} where we define χf (x) = χ(f, x). For
instance, suppose examples are just points on the line, and C = {fa (x) : fa (x) = 1
406 An Augmented PAC Model for Semi-Supervised Learning
This is the analogue of theorem 22.6 for the infinite case. In particular, this implies
Interpretation that if err(c∗ ) = 0 and err unl (c∗ ) ≤ t, then with high probability the f ∈ C that
optimizes err(f
< ) and err < unl (f ) has err(f ) ≤ ǫ.
Proof sketch: By standard VC bounds (Devroye et al., 1996; Vapnik, 1998), the
number of unlabeled examples is sufficient to ensure that with probability 1−δ/2 we
can estimate, within ǫ, Prx∈D [χf (x) = 1] for all χf ∈ χ(C). Since χf (x) = χ(f, x),
this implies we can estimate, within ǫ, the unlabeled error rate err unl (f ) for all
f ∈ C, and so the set of hypotheses with err< unl (f ) ≤ t+ǫ is contained in CD,χ (t+2ǫ).
The bound on the number of labeled examples follows from (Devroye et al., 1996)
(where it is shown that the expected number of partitions can be used instead of
the maximum in the standard VC proof). This bound ensures that with probability
1 − δ/2, none of the functions in CD,χ (t + 2ǫ) whose true (labeled) error is at least
ǫ have an empirical (labeled) error of 0.
We can also give a bound where we specify the number of labeled examples as a
function of the unlabeled sample; this is useful because we can imagine our learning
22.3 Sample Complexity Results 407
algorithm performing some calculations over the unlabeled data and then deciding
how many labeled examples to purchase.
max[V Cdim(C), V Cdim(χ(C))] 1 1 2
O 2
log + 2 log
ǫ ǫ ǫ δ
is sufficient so that with probability ≥ 1 − δ we have that simultaneously for every
k ≥ 0 the following is true: if we label mk examples drawn uniformly at random
from S, where
4 2(k + 1)(k + 2)
mk > log(2s) + log and s = CS,χ ((k + 1)ǫ) 2mk , S ,
ǫ δ
408 An Augmented PAC Model for Semi-Supervised Learning
mk,i > 8
ǫ2 log(2s) + log 4(k+1)(k+2)(i+1)(i+2)
δ and
s = CS,χ ((k + 1)ǫ) 2mk , S ,
We can similarly derive tight bounds using Rademacher averages. For different
versions of our statements using recent stronger bounds (Boucheron et al., 2000,
2005), see (Balcan and Blum, 2005).
The bounds in the previous section are for uniform convergence: they provide
guarantees for any algorithm that optimizes well on the observed data. In this
section, we consider stronger bounds based on ǫ-covers that can be obtained for
algorithms that behave in a specific way: they first use the unlabeled examples to
choose a “representative” set of compatible hypotheses, and then use the labeled
sample to choose among these. Bounds based on ǫ-covers exist in the classical PAC
setting, but in our framework these bounds and algorithms of this type are especially
natural and convenient.
Recall that a set Cǫ ⊆ 2X is an ǫ-cover for C with respect to D if for every f ∈ C
there is a f ′ ∈ Cǫ which is ǫ-close to f . That is, Prx∼D (f (x) = f ′ (x)) ≤ ǫ.
To illustrate how this can produce stronger bounds, consider the setting of
example 3 (graph-based algorithms) where the graph g consists of two cliques of
n/2 vertices, connected together by o(n2 ) edges (in particular, the number of edges
connecting the cliques is small compared to ǫn2 ). Suppose the target function labels
one of the cliques as positive and one as negative, and we define compatibility of a
examples where hypothesis to be the fraction of edges in g that are cut by it (so the target function
ǫ-cover bounds indeed has unlabeled error rate less than ǫ). Now, given any set SL of ml ≪ ǫn
beat uniform
convergence
bounds
22.3 Sample Complexity Results 409
Theorem 22.12 If t is an upper bound for errunl (c∗ ) and p is the size of a
minimum ǫ-cover for CD,χ (t+ 4ǫ), then using mu unlabeled examples and ml labeled
examples for
V Cdim (χ(C)) 1 1 2 1 p
mu = O log + log and m l = O ln ,
ǫ2 ǫ ǫ2 δ ǫ δ
we can with probability 1 − δ identify a hypothesis which is 10ǫ close to c ∗ .
Proof sketch: First, given the unlabeled sample SU , define Hǫ ⊆ C as follows: for
4. Effectively, ǫ-cover bounds allow one to rule out a hypothesis that, say, just separates
the positive points in SL from the rest of the graph by noting that this hypothesis is very
close (with respect to D) to the all-negative hypothesis, and that hypothesis has a high
labeled-error rate.
5. Proof: Let V be the set of all variables that (a) appear in every positive example of SL
and (b) appear√in no negative example of SL . Over the draw of SL , each variable has a
2|SL |
(1/2)
√ = 1/ d chance of belonging to V , so with high probability V has size at least
1
2
d. Now, consider the hypothesis corresponding to the conjunction of all variables in V .
This correctly classifies the examples in SL , and w.h.p. it classifies every other example
in SU negative because each example in SU has only a 1/2|V | chance of satisfying every
variable in V , and the size of SU is much less than 2|V | . So, this means it is compatible
with SU and consistent with SL , even though its true error is high.
410 An Augmented PAC Model for Semi-Supervised Learning
2. Using unlabeled data, determine Hǫi+1 by crossing out from Hǫi those hypotheses
ˆ i , f ) < 3ǫ.
f with the property that d(g
3. If Hǫi+1 = ∅ then set s = i and stop; else, increase i by 1 and go to 1.
Our bound on mu is sufficient to ensure that, with probability ≥ 1 − δ/2, Hǫ is
an ǫ-cover of C, which implies that, with probability ≥ 1 − δ/2, Cǫ is an ǫ-cover for
CD,χ (t). It is then possible to show Gǫ is, with probability ≥ 1 − δ/2, a 5ǫ-cover for
CD,χ (t) of size at most p. The idea here is that by greedily creating a 3ǫ-cover of
Cǫ with respect to distribution SU , we are creating a 4ǫ-cover of Cǫ with respect to
D, which is a 5ǫ-cover of CD,χ (t) with respect to D. Furthermore, we are doing this
using no more functions than would a greedy 2ǫ-cover procedure for CD,χ (t + 4ǫ)
with respect to D, which is no more than the optimal ǫ-cover of CD,χ (t + 4ǫ).
Now to learn c∗ we use labeled data and we do empirical risk minimization on Gǫ .
By standard bounds (see, for instance, (Benedek and Itai, 1991)), the number of
labeled examples is enough to ensure that with probability ≥ 1 − δ/2 the empirical
optimum hypothesis in Gǫ has true error at most 10ǫ. This implies that overall,
with probability ≥ 1 − δ, we find a hypothesis of error at most 10ǫ.
As an interesting case where unlabeled data help substantially, consider a co-
training setting where the target c∗ is fully compatible and D satisfies the con-
ditional independence given the label property. As shown by Blum and Mitchell
(1998), one can boost any weak hypothesis from unlabeled data in this setting (as-
suming one has enough labeled data to produce a weak hypothesis). Related sample
complexity results are given in (Dasgupta et al., 2001). We can actually show that
given enough unlabeled data, in fact we can learn from just a single labeled exam-
ple. Specifically, it is possible to show that for any concept classes C1 and C2 , we
have:
Theorem 22.13 Assume that err(c∗ ) = errunl (c∗ ) = 0 and D satisfies indepen-
dence given the label. Then using mu unlabeled examples and ml labeled examples
we can find a hypothesis that with probability 1 − δ has error at most ǫ, provided
that
1 1 1
mu = O · (V Cdim(C1 ) + V Cdim(C2 )) · ln + ln
ǫ ǫ δ
22.3 Sample Complexity Results 411
and
1
ml = O log( 1 ) .
ǫ δ
ǫ-cover bounds
Proof sketch: For convenience we will show a bound with 6ǫ instead of ǫ, 3δ instead
for co-training
of δ, and we will assume for simplicity the setting of example 3, where c∗ = c∗1 = c∗2
and also that D1 = D2 = D (the general case is handled similarly, but just requires
more notation). We first characterize the hypotheses with true unlabeled error rate
at most ǫ. Recall that χ(f, D) = Prx1 ,x2 ∼D [f (x1 ) = f (x2 )], and for concreteness
assume f predicts using x1 if f (x1 ) = f (x2 ). Consider f ∈ C with errunl (f ) ≤ ǫ and
let’s define p− = Prx∈D [c∗ (x) = 0], p+ = Prx∈D [c∗ (x) = 1] and for i, j ∈ {0, 1}
define pij = Prx∈D [f (x) = i, c∗ (x) = j]. We clearly have err (f ) = p10 + p01 . From
errunl (f ) = Pr(x1 ,x2 )∼D [f (x1 ) = f (x2 )] ≤ ǫ, using the independence given the
label of D, we get 2p10 p−
p00
+ 2p01
p+
p11
≤ ǫ. This implies that the almost compatible
hypothesis f must be one of the following four types:
1. f is “close to c∗ ” or more exactly err(f ) ≤ 2ǫ.
2. f is “close to the opposite of c∗ ” or more exactly err(f ) ≥ 1 − 2ǫ.
3. f “predicts almost always negative” or more exactly p10 + p11 ≤ 3ǫ.
4. f “predicts almost always positive” or more exactly p01 + p00 ≤ 3ǫ.
Now, consider f1 to be the constant positive function, f0 to be the constant negative
function. The unlabeled sample SU is sufficient to ensure that probability ≥ 1 − δ,
every hypothesis with zero estimated unlabeled error has true unlabeled error at
most ǫ. Therefore, by our previous analysis, there are only four kinds of hypotheses
consistent with unlabeled data: those close to c∗ , those close to its complement
c∗ , those close to f0 , and those close to f1 . Furthermore, c∗ , c∗ , f0 , and f1 are
compatible with the unlabeled data.
We now check if there exists a hypothesis g ∈ C with err < unl (g) = 0 such that
dˆf1 ,g ≥ 4ǫ and dˆf0 ,g ≥ 4ǫ. If such a hypothesis g exists, then we know that one of
{g, g}, where g is the opposite of g, is 2ǫ-close to c∗ . If not, we must have p+ ≤ 6ǫ
or p− ≤ 6ǫ, in which case we know that one of {f0 , f1 } is 6ǫ-close to c∗ . So, we
have a set of two functions, opposite to each other, one of which is at least 6ǫ-close
to c∗ . We now use labeled data to pick one of these to output, using lemma 22.14
below.
= >
Lemma 22.14 Consider ǫ < 18 . Let Cǫ = f, f be a subset of C containing
two opposite hypotheses with the property that one of them is ǫ-close to c ∗ . Then,
ml > 6 log( 1 ) 1δ labeled examples are sufficient so that with probability ≥ 1 − δ,
ǫ
the concept in Cǫ that is ǫ-close to c∗ in fact has lower empirical error.
ml
⌊
2 ⌋
1 ml k
Proof Easy calculation: if ml > 6 log 1
ǫ δ , then k ǫ(ml −k) (1 − ǫ) ≤ δ.
k=0
examples needed ml to 1. In fact, this result can be extended to the case considered
in (Balcan et al., 2004), that D + and D− merely satisfy constant expansion.
This example illustrates that if data are especially well behaved with respect to
the compatibility notion, then our bounds on labeled data can be extremely good.
In section 22.4.2, we show for the case of linear separators and independence given
the label, we can give efficient algorithms, achieving the bounds in theorem 22.13 in
terms of labeled examples by a polynomial time algorithm. Note, however, that both
these bounds rely heavily on the assumption that the target is fully compatible. If
the assumption is more of a “hope” than a belief, then one would need additional
labeled examples just to validate the hypothesis produced.
We give here a simple example to illustrate the bounds in section 22.3.1, and for
which we can give a polynomial-time algorithm that takes advantage of them. Let
the instance space X = {0, 1}d, and for x ∈ X, let vars(x) be the set of variables
set to 1 by x. Let C be the class of monotone disjunctions (e.g., x1 ∨ x3 ∨ x6 ),
and for f ∈ C, let vars(f ) be the set of variables disjoined by f . Now, suppose we
say an example x is compatible with function f if either vars(x) ⊆ vars(f ) or else
vars(x) ∩ vars(f ) = φ. This is a very strong notion of “margin”: it says, in essence,
that every variable is either a positive indicator or a negative indicator, and no
example should contain both positive and negative indicators.
Given this setup, we can give a simple PACunl -learning algorithm for this pair
(C, χ). We begin by using our unlabeled data to construct a graph on d vertices (one
per variable), putting an edge between two vertices i and j if there is any example x
in our unlabeled sample with i, j ∈ vars(x). We now use our labeled data to label the
components. If the target function is fully compatible, then no component will get
multiple labels (if some component does get multiple labels, we halt with failure).
Finally, we produce the hypothesis f such that vars(f ) is the union of the positively
labeled components. This is fully compatible with the unlabeled data and has zero
error on the labeled data, so by theorem 22.5, if the sizes of the data sets are as
given in the bounds, with high probability the hypothesis produced will have error
≤ ǫ.
Notice that if we want to view the algorithm as “purchasing” labeled data, then
we can simply examine the graph, count the number of connected components k,
and then request 1ǫ [k ln 2 + ln 2δ ] labeled examples. (Here, 2k = |CS,χ (0)|.) By the
proof of theorem 22.5, with high probability 2k ≤ |CD,χ (ǫ)|, so we are purchasing
no more than the number of labeled examples in the theorem statement.
Also, it is interesting to see the difference between a “helpful” and “nonhelpful”
22.4 Algorithmic Results 413
We now consider the case of co-training where the hypothesis class is the class of
linear separators. For simplicity we focus first on the case of example 4: the target
function is a linear separator in Rd and each example is a pair of points, both of
which are assumed to be on the same side of the separator (i.e., an example is a line
segment that does not cross the target hyperplane). We then show how our results
can be extended to the more general setting.
As in the previous example, a natural approach is to try to solve the “consistency”
problem: given a set of labeled and unlabeled data, our goal is to find a separator
that is consistent with the labeled examples and compatible with the unlabeled ones
(i.e., it gets the labeled data correct and doesn’t cut too many edges). Unfortunately,
this consistency problem is NP-hard: given a graph g embedded in Rd with two
distinguished points s and t, it is NP-hard to find the linear separator that cuts
the minimum number of edges, even if the minimum is 0 (Flaxman, 2003). For this
reason, we will make an additional assumption, that the two points in an example
are each drawn independently given the label. That is, there is a single distribution
D over Rd , and with some probability p+ , two points are drawn i.i.d. from D +
(D restricted to the positive side of the target function) and with probability
1 − p+ , the two are drawn i.i.d from D − (D restricted to the negative side of the
need to assume target function). Note that our sample complexity results in section 22.3.3 extend
independence for to weaker assumptions such as distributional expansion introduced by Balcan et al.
our algorithmic (2004), but we need true independence for our algorithmic results. Blum and
results Mitchell (1998) have also given positive algorithmic results for co-training when
(a) the two halves of an example are drawn independently given the label (which
we are assuming now), (b) the underlying function is learnable via statistical query
algorithms6 (which is true for linear separators (Blum et al., 1998)), and (c) we
have enough labeled data to produce a weakly useful hypothesis (defined below)
on one of the halves to begin with. We give here an improvement over that result
by showing how we can run the algorithm in (Blum and Mitchell, 1998) with only
a single labeled example, thus obtaining an efficient algorithm in our model. It is
worth noticing that in the process, we also simplify the results of Blum et al. (1998)
6. For a detailed description of the statistical query model see (Kearns, 1998) and (Kearns
and Vazirani, 1994).
414 An Augmented PAC Model for Semi-Supervised Learning
somewhat.
For the analysis below, we need the following definition. A weakly useful predictor
is a function f such that for some ǫ that is at least inverse polynomial in the input
size,
Proof sketch: Assume for convenience that the target separator passes through the
origin, and let us denote the separator by c∗ · x = 0. We will also assume for
convenience that PrD (c∗ (x) = 1) ∈ [ǫ/2, 1 − ǫ/2]; that is, the target function is not
overwhelmingly positive or overwhelmingly negative (if it is, this is actually an easy
case, but it makes the arguments more complicated). Define the margin of some
point x as the distance of x/|x| to the separating plane, or equivalently, the cosine
of the angle between c∗ and x. = i i >
= i > by drawing a large unlabeled sample S = x1 , x2 ; denote by Sj
We begin
the set xj , for j = 1, 2. (We describe our algorithm as working with the fixed
unlabeled sample S, since we just need to apply standard VC-dimension arguments
to get the desired result.) The first step is to perform a transformation T on S1
to ensure that some reasonable (1/poly) fraction of T (S1 ) has margin at least
1/poly, which we can do via the outlier removal lemma of Blum et al. (1998)
and Dunagan and Vempala (2001).7 The outlier removal lemma states that one
can algorithmically remove an ǫ′ fraction of S1 and ensure that for the remainder,
for any vector w, maxx∈S1 (w · x)2 ≤ poly(n, b, 1/ǫ′)Ex∈S1 [(w · x)2 ], where b is the
number of bits needed to describe the input points. We reduce the dimensionality (if
necessary) to get rid of any of the vectors for which the above quantity is zero. We
then determine a linear transformation (as described in Blum et al. (1998)) so that
in the transformed space for all unit-length w, Ex∈T (S1 ) [(w · x)2 ] = 1). Since the
maximum is bounded, this guarantees that at least a 1/poly fraction of the points
in T (S1 ) have at least a 1/poly margin with respect to the separating hyperplane.
To avoid cumbersome notation in the rest of the discussion, we drop our use
of “T ” and simply use S and c∗ to denote the points and separator in the
transformed space. (If the distribution originally had a reasonable probability mass
at a reasonable margin from c∗ , then T could be the identity anyway.)
7. If the reader is willing to allow running time polynomial in the margin of the data set,
then this part of the argument is not needed.
22.4 Algorithmic Results 415
The second step is we argue that a random halfspace has at least a 1/poly chance
of being a weak predictor on S1 . ((Blum et al., 1998) use the Perceptron algorithm
to get weak learning; here, we need something simpler since we do not yet have any
labeled data.) Specifically, consider a point x such that the angle between x and c∗
is π/2 − γ, and imagine that we draw f at random subject to f · c∗ ≥ 0 (half of the
f ’s will have this property). Then,
Since at least a 1/poly fraction of the points in S1 have at least a 1/poly margin,
this implies that
Prf,x [f (x) = 1|c∗ (x) = 1] > Prf,x [f (x) = 1|c∗ (x) = 0] + 1/poly.
This means that a 1/poly probability mass of functions f must in fact be weakly
useful predictors.
The final step of the algorithm is as follows. Using the above observation, we pick
a random f , and plug it into the bootstrapping theorem of (Blum and Mitchell,
1998) (which, given unlabeled pairs xi1 , xi2 ∈ S, will use f (xi1 ) as a noisy label
of xi2 , feeding the result into a statistical query algorithm), repeating this process
poly(n) times. With high probability, our random f was a weakly useful predictor
on at least one of these steps, and we end up with a low-error hypothesis. For
the rest of the runs of the algorithm, we have no guarantees. We now observe the
following. First of all, any function f with small err(f ) must have small err unl (f ).
Second, because of the assumption of independence given the label, as shown in
theorem 22.13, the only functions with low unlabeled error rate are functions close
to c∗ , close to ¬c∗ , close to the “all-positive” function, or close to the “all-negative”
function.
So, if we simply examine all the hypotheses produced by this procedure, and pick
some h with a low unlabeled error rate that is at least ǫ/2-far from the “all-positive”
or “all-negative” functions, then either f or ¬f is close to c∗ . We can now just draw
a single labeled example to determine which case is which.
We can easily extend our algorithm to the standard co-training setting (where c ∗1
can be different from c∗2 ) as follows: we repeat the procedure in a symmetric way,
and then, in order to find a good pair of functions, just try all combinations of pairs
of functions to find one of small unlabeled error rate, not close to “all positive,” or
“all negative.” Finally we use one labeled example to produce a low-error hypothesis
(and here we use only one part of the example and only one of the functions in the
pair).
416 An Augmented PAC Model for Semi-Supervised Learning
We can also talk about a transductive analogue of our (inductive) model that
incorporates many of the existing transductive methods for learning with labeled
and unlabeled data. In a transductive setting one assumes that the unlabeled sample
S is given, a random small subset is labeled, and the goal is to predict well on
the rest of S. In order to make use of unlabeled examples, we will again express
the relationship we hope the target function has with the distribution through a
compatibility notion χ. However, since in this case the compatibility between a
given hypothesis and D is completely determined by S (which is known), we will
not need to require that compatibility be an expectation over unlabeled examples.
Given this setup, from the sample-complexity point of view we only care about how
much labeled data we need, and algorithmically we need to find a highly compatible
hypothesis with low error on the labeled data.
Rather than presenting general theorems, we instead focus on the modeling
aspect and give here several examples in the context of graph-based semi-supervised
algorithms for binary classification. In these methods one usually assumes that there
is weighted graph g defined over S, which is given a priori and encodes the prior
knowledge. In the following we denote by W the weighted adjacency matrix of g
and by CS the set of all binary functions over S.
Minimum Cut: Suppose for f ∈ CS we define the incompatibility of f to be the
minimum cut weight of the cut in g determined by f . This is the implicit notion of compatibility
considered in (Blum and Chawla, 2001), and algorithmically the goal is to find the
most compatible hypothesis that gets the labeled data correct, which can be solved
efficiently using network flow. From a sample-complexity point of view, the number
of labeled examples we need is proportional to the VC dimension of the class of
hypotheses that are at least as compatible as the target function, which is known
to be O(k/λ) (see (Kleinberg, 2000; Kleinberg et al., 2004)), where k is the number
of edges cut by c∗ and λ is the size of the global minimum cut in the graph. Also
note that the randomized min-cut algorithm (considered by Blum et al. (2004)),
which is an extension of the basic min-cut approach, can be viewed as motivated
by a PAC-Bayes sample complexity analysis of the problem.
Normalized Cut: Consider the normalized cut setting of Joachims (2003) and for
f ∈ CS define size(f ) to be the weight of the cut in g determined by f , and
normalized graph let fneg and fpos be the number of points in S on which h predicts negative and
cuts with positive, respectively. For f ∈ CS , define the incompatibility of f to be fsize(f )
neg ·fpos
.
constraints Note that this is the implicit compatibility function used in Joachims (2003), and
again, algorithmically the goal would be to find a highly compatible hypothesis
that gets the labeled data correct. Unfortunately, the corresponding optimization
problem is in this case NP-hard. Still, several approximate solutions have been
22.5 Related Models and Discussion 417
It is also interesting to consider how generative models fit into our model. As
mentioned in section 22.1, a typical assumption in a generative setting is that D is
a mixture with the probability density function p(x|θ) = p0 · p0 (x|θ0 ) + p1 · p1 (x|θ1 )
(see, for instance, (Ratsaby and Venkatesh, 1995; Castelli and Cover, 1995, 1996)).
That means that the labeled examples are generated according to the following
mechanism: a label y ∈ {0, 1} is drawn according to the distribution of classes
{p0 , p1 } and then a corresponding random feature vector is drawn according to
the class-conditional density py . The assumption typically used is that the mixture
is identifiable. Identifiability ensures that the Bayes optimal decision border {x :
p0 · p0 (x|θ0 ) = p1 · p1 (x|θ1 )} can be deduced if p(x|θ) is known, and therefore one
can construct an estimate of the Bayes border by using p(x|θ̂) instead of p(x|θ).
Essentially once the decision border is estimated, a small labeled sample suffices to
learn (with high confidence and small error) the appropriate class labels associated
how the with the two disjoint regions generated by the estimate of the Bayes decision border.
generative models To see how we can incorporate this setting in our model, consider for illustration the
fit into our model setting in Ratsaby and Venkatesh (1995); there they assume that p 0 = p1 , and that
the class-conditional densities are d-dimensional Gaussians with unit covariance and
unknown mean vectors θi ∈ Rd . The algorithm used is the following: the unknown
parameter vector θ = (θ0 , θ1 ) is estimated from unlabeled data using a maximum-
likelihood estimate; this determines a hypothesis which is a linear separator that
passes through the point (θ̂0 + θ̂1 )/2 and is orthogonal to the vector θ̂1 − θ̂0 ; finally
each of the two decision regions separated by the hyperplane is labeled according
to the majority of the labeled examples in the region. Given this setting, a natural
notion of compatibility we can consider is the expected log-likelihood function
8. For a more detailed discussion on this see also chapter 7 in this book.
418 An Augmented PAC Model for Semi-Supervised Learning
(where the expectation is taken with respect to the unknown distribution specified
by θ). Specifically, we can identify a legal hypothesis fθ with the set of parameters
θ = (θ0 , θ 1 ) that determine it, and then we can define χ(fθ , D) = Ex∈D [log(p(x|θ))].
Ratsaby and Venkatesh (1995) show that if the unlabeled sample is large enough,
then all hypotheses specified by parameters θ which are close enough to θ will have
the property that their empirical compatibilities will be close enough to their true
compatibilities. This then implies (together with other observations about Gaussian
mixtures) that the maximum-likelihood estimate will be close enough to θ, up to
permutations. (This actually motivates χ as a good compatibility function in our
model.)
More generally, if we deal with other parametric families (but we are in the
same setting), we can use the same compatibility notion; however, we will need to
impose certain constraints on the distributions allowed in order to ensure that the
compatibility is actually well defined (the expected log likelihood is bounded).
As mentioned in section 22.1 this kind of generative setting is really at the extreme
of our model. The assumption that the distribution that generates the data is really
a mixture implies that if we knew the distribution, then there are only two possible
concepts left (and this makes the unlabeled data extremely useful).
It is worth noticing that there is a strong connection between our approach and
the luckiness framework (see (Shawe-Taylor et al., 1998; Mendelson and Philips,
relationship to 2003)). In both cases, the idea is to define an ordering of hypotheses that depends
the luckiness on the data, in the hope that we will be “lucky” and find that not too many other
framework functions are as compatible as the target. There are two main differences, however.
The first is that the luckiness framework (being designed for supervised learning
only) uses labeled data both for estimating compatibility and for learning: this is a
more difficult task, and as a result our bounds on labeled data can be significantly
better. For instance, in example 4 described in section 22.2, for any nondegenerate
distribution, a data set of d/2 pairs can with probability 1 be completely shattered
by fully compatible hypotheses, so the luckiness framework does not help. In
contrast, with a larger (unlabeled) sample, one can potentially reduce the space of
compatible functions quite significantly, and learn from o(d) or even O(1) labeled
examples depending on the distribution (see sections 22.3.3 and 22.4). Secondly, the
luckiness framework talks about compatibility between a hypothesis and a sample,
whereas we define compatibility with respect to a distribution. This allows us to
talk about the amount of unlabeled data needed to estimate true compatibility.
There are also a number of differences at the technical level of the definitions.
22.5.4 Conclusions
Given the easy availability of unlabeled data in many settings, there has been
growing interest in methods that try to use such data together with the (more
22.5 Related Models and Discussion 419
expensive) labeled data for learning. Nonetheless, there is still substantial disagree-
ment and no clear consensus about when unlabeled data help and by how much. In
this chapter, we have provided a PAC-style model for semi-supervised learning that
captures many of the ways unlabeled data are typically used, and provides a very
general framework for thinking about this issue. The high-level main implication
of our analysis is that unlabeled data are useful if (a) we have a good notion of
compatibility so that the target function indeed has a low unlabeled error rate, (b)
the distribution D is helpful in the sense that not too many other hypotheses also
have a low unlabeled error rate, and (c) we have enough unlabeled data to estimate
unlabeled error rates well. One consequence of our model is that if the target func-
tion and data distribution are both well behaved with respect to the compatibility
notion, then the sample-size bounds we get for labeled data can substantially beat
what one could hope to achieve through pure labeled-data bounds, and we have
illustrated this with a number of examples throughout the chapter.
23 Metric-Based Approaches for Semi-
Supervised Regression and Classification
23.1 Introduction
several hypotheses that behave similarly on the training data and yet behave
quite differently in other parts of the domain—thus diminishing the ability to
distinguish good hypotheses from bad. Since significantly different hypotheses
cannot be simultaneously accurate, one must restrict the set of hypotheses to be
able to reliably differentiate between accurate and inaccurate predictors. On the
other hand, selecting hypotheses from an overly restricted class can prevent one
from being able to express a good approximation to the ideal predictor, thereby
causing important structure in the training data to be ignored—i.e., underfitting
underfitting the training data. Since both underfitting and overfitting result in large test error,
they must be avoided simultaneously. Consequently, a popular research topic in
learning is to find automated methods for calibrating hypothesis complexity. The
work presented here exploits unlabeled data in a novel fashion to achieve this goal.
We consider two classical approaches to this problem, typically referred to as
model selection and regularization, respectively (Cherkassky and Mulier, 1998; Vap-
model selection nik, 1995, 1998). In model selection one first takes a base hypothesis class, H, decom-
poses it into a discrete collection of subclasses H0 ⊂ H1 ⊂ · · · = H (say, organized
in a nested chain, or lattice) and then, given training data, attempts to identify
the optimal subclass from which to choose the final hypothesis.1 There have been a
variety of methods proposed for choosing the optimal subclass, but most techniques
fall into one of two basic categories: complexity penalization (e.g., the minimum de-
scription length principle (Rissanen, 1986) and various statistical selection criteria
(Foster and George, 1994)); and holdout testing (e.g., cross-validation and boot-
regularization strapping (Efron, 1979)). Regularization is similar to model selection except that
one does not impose a discrete decomposition on the base hypothesis class. Instead,
a penalty criterion is imposed on the individual hypotheses, which either penalizes
their parametric form (e.g., as in ridge regression or weight decay in neural network
training (Cherkassky and Mulier, 1998; Ripley, 1996; Bishop, 1995)) or penalizes
their global smoothness properties (e.g., minimizing curvature (Poggio and Girosi,
1990)). These methods have shown impressive improvements over naive learning
algorithms in every area of supervised learning research. However, one difficulty
with these techniques is that they usually require expertise to apply properly, and
often involve free parameters that must be set by an informed practitioner.
The contribution presented here is the derivation of parameter-free methods for
model selection and regularization that improve on the robustness of standard
approaches by using unlabeled data. As has been seen in other sections of the book,
most semi-supervised learning techniques require explicit assumptions about the
relationship between labeled and unlabeled data. For the methods presented here,
the only assumption required is that the labeled data and the unlabeled data come
from the same distribution. The methods we propose automatically differentiate
1. The term model selection has also been used to refer to other processes in machine
learning and statistics, such as choosing the kernel for support vector machines or Bayesian
model selection, but we restrict our attention to the classical form described above.
23.2 Metric Structure of Supervised Learning 423
hypotheses based on the difference of their behavior off of the labeled training set
(i.e., behavior at points not covered by the training set). Like many of the semi-
supervised learning approaches proposed in this book (e.g., chapters 10 and 11), our
methods regularize in a data-specific fashion rather than simply penalizing model
complexity. This allows modern techniques to potentially outperform traditional
fixed regularizers that penalize complexity identically across different training
samples.
To begin, section 23.2 introduces the idea of metric spaces for hypotheses, allow-
ing the geometric characterization of the supervised learning problem. Section 23.3
investigates how unlabeled data can be used to perform model selection in nested
sequences of hypothesis spaces. The strategies developed are shown to experimen-
tally outperform standard model selection methods and have been proved to be
robust in theory. Section 23.4 considers regularization and shows how the proposed
model selection strategies can be extended to a generalized training objective for
supervised regression. Here the idea is to use unlabeled data to automatically tune
the degree of regularization for a given task without having to set free parameters by
hand. The resulting regularization technique adapts its behavior to a given training
set and can outperform standard fixed regularizers for a given problem. Section 23.5
extends the earlier regression approach from section 23.4 to probabilistic classifiers.
Finally, section 23.6 concludes with an examination of potential avenues for future
research.
In supervised learning, one takes a sequence of training pairs x1 , y1 , ..., xl , yl
and attempts to infer a hypothesis function h : X → Y that achieves small prediction
error err(h(x), y) on future test examples. This basic paradigm covers many of the
tasks studied in machine learning research.
For model selection and regularization tasks it is necessary to be able to compare
hypothesis functions. The approach we pursue in this chapter is to exploit a concrete
notion of distance between hypothesis functions. Consider the metric structure on
metric on space a space of hypothesis functions that arises from a simple statistical model of the
of hypotheses supervised learning problem: Assume the examples x, y are generated by a fixed
joint distribution PXY on X × Y. In learning a hypothesis function h : X → Y the
primary interest is in modeling some aspect of the conditional distribution PY|X .
Here the utility of using extra information about the marginal domain distribution
PX to choose a good hypothesis is investigated. Note that information about PX
can be obtained from a collection of unlabeled training examples xl+1 , ..., xn . The
significance of having information about the domain distribution PX is that it
defines a natural (pseudo) metric on the space of hypotheses. That is, for any two
hypothesis functions f and g, one can obtain a measure of the distance between
424 Metric-Based Approaches for Semi- Supervised Regression and Classification
where err(ŷ, y) is the natural measure of prediction error for the problem at hand
(e.g., regression or classification) and ϕ is an associated normalization function that
recovers the standard metric axioms.
For the problem of regression, prediction error can be measured by squared
difference err(ŷ, y) = (ŷ − y)2 or some similar loss. For classification problems,
prediction error can be measured with the misclassification loss err(ŷ, y) = 1(ŷ=y) .
The standard metric properties to be satisfied are non-negativity d(f, g) ≥ 0,
symmetry d(f, g) = d(g, f ), and the triangle inequality d(f, g) ≤ d(f, h) + d(h, g). It
turns out that most typical prediction error functions admit a metric of this type.
For example, in regression the distance between two prediction functions can be
measured by
1/2
d(f, g) = (f (x) − g(x))2 dPX ,
where the normalization function ϕ(z) = z 1/2 establishes the metric properties. In
classification, the distance between two classifiers can be measured by
d(f, g) = 1(f (x)=g(x)) dPX
= PX (f (x) = g(x)),
That is, one can interpret the true error of a hypothesis function h with respect to
a target conditional PY|X as a distance between h and PY|X . The significance of this
definition is that it is consistent with the previous definition (23.1) and one can
therefore embed the entire supervised learning problem in a common metric space
structure.
To illustrate: in regression, (23.2) yields the root mean squared error of a
hypothesis:
1/2
d(PY|X , h) = (h(x) − y)2 dPY|x dPX ,
23.2 Metric Structure of Supervised Learning 425
• PY|X
•
• H h
f
•
g
Figure 23.1 Metric space view of supervised learning: Unlabeled data can accurately
estimate distances between functions f and g within H, however only limited labeled data
are available to estimate the closest function h to PY|X .
Together, the definitions in Eqs. 23.1 and 23.2 show how to impose a global metric
space view of the supervised learning problem (figure 23.1). Given labeled training
examples x1 , y1 , ..., xl , yl , the goal is to find the hypothesis h in a space H that
is closest to a target conditional PY|X under the distance measure (Eq. 23.2). If there
is also a large set of u auxiliary unlabeled examples xl+1 , ..., xn , such that u = n− l,
then one can also accurately estimate the distances between alternative hypotheses
f and g within H, effectively giving Eq. 23.1:
⎛ ⎞
n
˜ g) = ϕ ⎝ △ 1
d(f, err(f (xj ), g(xj ))⎠ . (23.3)
u
j=l+1
That is, for sufficiently large u, the distances defined in Eq. 23.3 will be very close
to the distances defined in Equation 23.1. In fact, below we sill generally assume
that u is large enough to ensure d(f,˜ g) ≈ d(f, g). However, the distances between
hypotheses and the target conditional PY|X (Eq. 23.2) can only be weakly estimated
using the (presumably much smaller) set of labeled training data:
l
ˆ △ 1
d(PY|X , h) = ϕ err(h(xi ), yi ) . (23.4)
l i=1
This measure need not be close to Equation 23.2. The challenge then is to ap-
proximate the closest hypothesis to the target conditional as accurately as possible
using the available information (Eqs. 23.3 and 23.4) in place of the true distances
(Eqs. 23.1 and 23.2).
This metric space perspective will be used to devise novel model selection
and regularization strategies that exploit interhypothesis distances measured on
426 Metric-Based Approaches for Semi- Supervised Regression and Classification
First consider the process of using model selection to choose the appropriate
level of hypothesis complexity to fit to data. This is, conceptually, the simplest
approach to automatic complexity control for supervised learning. The idea is to
stratify the hypothesis class H into a sequence (or lattice) of nested subclasses
H0 ⊂ H1 ⊂ · · · = H, and then, given training data, somehow choose a class that
has the proper complexity for the given data. To understand how one might make
this choice, note that for a given training sample x1 , y1 , . . . , xl , yl one can,
empirical risk in principle, obtain the corresponding sequence of empirically optimal functions
minimization h 0 ∈ H0 , h 1 ∈ H1 , . . .
l
1 ˆ Y|X , h).
hk = arg min ϕ err(h(xi ), yi ) = arg min d(P
h∈Hk l i=1 h∈Hk
That is, here we assume an empirical risk minimization procedure is used to select
a candidate function from each class, and moreover we assume a unique minimizer
exists for each Hk .2 The problem is to select one of these functions based on the
ˆ Y|X , h0 ), d(P
observed training errors d(P ˆ Y|X , h1 ), . . . (figure 23.2). However, because
each hypothesis class subsumes those before it, these errors must monotonically
decrease (assuming one can fully optimize in each class) and therefore choosing
the function with smallest training error inevitably leads to overfitting. Some other
criterion beyond mere empirical-error minimization must be invoked to make the
final selection.
..
..
..
.. ..
.. ...
.. .. ..
.. ..
.. ... .. ..
.. .. .. .. ..
.. .. .. .. .
h0 h1 h2 h3 h4
Figure 23.2 Sequence of empirically optimal functions induced by a chain H 0 ⊂ H1 ⊂ ...
on a given training set: Dotted lines indicate decreasing optimal training distances
ˆ 0 , PY|X ), d(h
d(h ˆ 1 , PY|X ), ... and solid lines indicate distances between hypotheses. The final
hypothesis must be selected on the basis of these estimates.
2. This uniqueness assumption is reasonable for regression problems, but generally does
not hold for classification problems under 0-1 loss; see section 23.5 below.
23.3 Model Selection 427
The first intuition explored is that interhypothesis distances can help detect over-
fitting in a very simple manner. Consider two hypotheses hk and hk+1 that both
have a small estimated distance to PY|X and yet have a large true distance between
them. In this situation there should be concern in selecting the second hypothe-
sis, because if the true distance between hk and hk+1 is indeed large, then both
functions cannot be simultaneously close to PY|X , by simple geometry. This implies
that at least one of the distance estimates to PY|X must be inaccurate. The earlier
estimate should be more trusted because it comes from a more restricted class that
ˆ Y|X , hk ) and d(P
is less likely to overfit. In fact, if both d(P ˆ Y|X , hk+1 ) really were ac-
triangle curate estimates they would have to satisfy the triangle inequality with the known
inequality distance d(hk , hk+1 ); that is,
ˆ Y|X , hk ) + d(P
d(P ˆ Y|X , hk+1 ) ≥ d(hk , hk+1 ). (23.5)
To demonstrate this method (and all subsequent methods developed here), first
consider the problem of polynomial curve-fitting. This is a supervised learning
problem where X = R, Y = R, and the goal is to minimize the squared prediction
error, err(ŷ, y) = (ŷ − y)2 . Specifically, consider polynomial hypotheses h : R → R
under the natural stratification H0 ⊂ H1 ⊂ ... into polynomials of degree at
most 0, 1, ..., etc. The motivation for studying this task is that it is a well-studied
problem that still attracts a lot of interest (Cherkassky et al., 1997; Galarza et al.,
1996; Vapnik, 1995, 1998). Moreover, polynomials create a difficult model selection
problem that has a strong tendency to produce catastrophic overfitting effects.
Another benefit is that polynomials are an interesting and nontrivial class for which
there are efficient techniques for computing best-fit hypotheses.
To apply the metric-based approach to this task, define the metric d in terms of
the squared prediction error err(ŷ, y) = (ŷ − y)2 with a square root normalization
ϕ(z) = z 1/2 , as discussed in section 23.2. To evaluate the efficacy of TRI on this
problem, its performance was compared to a number of standard model selection
strategies, including structural risk minimization (SRM) (Cherkassky et al., 1997;
Vapnik, 1998), risk inflation criterion (RIC) (Foster and George, 1994), Shibata’s
model selector (SMS) (Shibata, 1981), generalized cross-validation (GCV) (Craven
and Wahba, 1979), Bayesian information criterion (BIC) (Schwarz, 1978), Akaike
information criterion (AIC) (Akaike, 1974), Mallows’ Cp statistic (CP) (Mallows,
1973), and finite prediction error (FPE) (Akaike, 1970). TRI was also compared to
tenfold cross-validation (CVT; a standard holdout method (Efron, 1979; Kohavi,
1995)).
A simple series of experiments was conducted by fixing a domain distribution
PX on X = R and then fixing various target functions f : R → R. The specific
target functions used in the experiments are shown in figure 23.3. To generate
training samples a sequence of values (x1 , . . . , xl ) were drawn, then the target
function values f (x1 ), . . . , f (xl ) computed and perturbed by adding independent
Gaussian noise with standard deviation σ = 0.05 to each. This resulted in a
labeled training sequence x1 , y1 , . . . , xl , yl . For a given training sample the
series of best-fit polynomials h0 , h1 , . . . of degree 0, 1, . . . was computed. Given this
sequence, each model selection strategy will choose some hypothesis h k on the basis
of the observed empirical errors. The implementation of TRI was given access to u
auxiliary unlabeled examples xl+1 , . . . , xn in order to estimate the true distances
between polynomials in the sequence.
The main emphasis in these experiments was to minimize the true distance
between the final hypothesis and the target conditional PY|X . That is, the primary
concern was choosing a hypothesis that obtained a small prediction error on future
test examples, independent of its complexity level. To determine the effectiveness
of the various selection strategies, the ratio of the true error (distance) of the
polynomial they selected to the best true error among polynomials in the sequence
h0 , h1 , ..., was measured. This means that the optimum achievable ratio was 1. The
23.3 Model Selection 429
2 1 1 1.5
0.8
1
1.5 0.6
0.4
0.5 0.5
1 0.2
0 0
0.5 −0.2
0 −0.5
−0.4
0 −0.6
−1
−0.8
Figure 23.3 Target functions used in the polynomial curve-fitting experiments (in
order): step(x ≥ 0.5), sin(1/x), sin2 (2πx), and a fifth-degree polynomial.
rationale for doing this was to measure the model selection strategy’s ability to
approximate the best hypothesis in the given sequence—not find a better function
from outside the sequence.3
Table 23.1 Fitting f (x) = step(x ≥ 0.5) with PX = U (0, 1) and σ = 0.05. Distribution
of approximation ratios achieved at training sample size l = 30, showing percentiles of
approximation ratios achieved in 1000 repeated trials.
Table 23.1 shows the results obtained for approximating a step function f (x) =
step(x ≥ 0.5) corrupted by Gaussian noise, where the marginal distribution PX
is uniform on [0, 1]. The strategy ADJ (adjusted-distance estimate) in the tables
is explained in section 23.3.3 below. These results were obtained by repeatedly
generating training samples of a fixed size and recording the approximation ratio
achieved by each strategy. The tables record the distribution of ratios produced by
each strategy for a training sample size of l = 30, using u = 200 unlabeled examples
to measure interhypothesis distances, repeated over 1000 trials. The initial results
appear to be quite positive. TRI achieves a median approximation ratio of 1.08.
This compares favorably to the median approximation ratio 1.54 achieved by SRM,
and 1.17 achieved by CVT. The remaining complexity penalization strategies—
GCV, FPE, etc.—all performed significantly worse on these trials. However, the
most notable difference was TRI’s robustness against overfitting. In fact, although
3. One could consider more elaborate strategies that choose hypotheses from outside the
sequence; e.g., by averaging several hypotheses together (Krogh and Vedelsby, 1995; Opitz
and Shavlik, 1996; Breiman, 1996). However, this idea will not be pursued further here.
430 Metric-Based Approaches for Semi- Supervised Regression and Classification
Table 23.2 Fitting f (x) = sin(1/x) with PX = U (0, 1) and σ = 0.05. Distribution of
approximation ratios achieved at training sample size l = 30, showing percentiles of
approximation ratios achieved in 1000 repeated trials.
Table 23.3 Fitting f (x) = sin2 (2πx) with PX = U (0, 1) and σ = 0.05. Distribution
of approximation ratios achieved at training sample size l = 30, showing percentiles of
approximation ratios achieved in 1000 repeated trials.
the penalization strategy SRM performed reasonably well much of the time, it was
prone to making periodic but catastrophic overfitting errors. Even the normally
well-behaved cross-validation strategy CVT made significant overfitting errors from
time to time. This is evidenced by the fact that in 1000 trials with a training
sample of size 30 (table 23.1) TRI produced a maximum approximation ratio of
2.18, whereas CVT produced a worst-case approximation ratio of 643, and the
penalization strategies SRM and GCV both produced worst-case ratios of 1.6 × 10 7.
The 95th percentiles were TRI 1.45, CVT 6.11, SRM 419, GCV 2.7 × 10 3 . Similar
results for TRI are obtained for larger labeled sample sizes, such as l = 100 and
l = 200. 4 For a broader selection of results see (Schuurmans and Southey, 2002).
The results showing TRI’s robustness against overfitting are encouraging, but
it is further possible to prove that TRI cannot produce an approximation ratio
greater than 3 due to overfitting. That is, we can bound TRI’s approximation ratio
under two simple assumptions. First, that TRI makes it to the best hypothesis hm
in the sequence. Second, that the empirical error of hm is an underestimate—that
4. Although one might suspect that the large failures could be due to measuring relative
instead of absolute error, it turns out that all of these large relative errors also correspond
to large absolute errors. This is verified in section 23.4.1 below.
23.3 Model Selection 431
Table 23.4 Fitting a fifth-degree polynomial f (x) with PX = U (0, 1) and σ = 0.05.
Distribution of approximation ratios achieved at training sample size l = 30, showing
percentiles of approximation ratios achieved in 1000 repeated trials.
Proposition 23.1 Let hm be the optimal hypothesis in the sequence h0 , h1 , ... (that
is, hm = arg minhk d(PY|X , hk )) and let hℓ be the hypothesis selected by TRI. If (i)
ˆ Y|X , hm ) ≤ d(PY|X , hm ) then:
m ≤ ℓ and (ii) d(P
Note that in proposition 23.1, as well as in propositions 23.2 and 23.3 below, it
is implicitly assumed that the true interhypothesis distances d(hm , hℓ ) are known.
This, in principle, must be measured on the true marginal PX . This assumption will
be relaxed in section 23.3.4 below.
Continuing with the experimental investigation, the basic flavor of the results
remains unchanged at different noise levels and for different domain distributions
PX . In fact, much stronger results are obtained for wider-tailed domain distributions
like Gaussian (Schuurmans and Southey, 2002) and “difficult” target functions like
sin(1/x) (table 23.2). Here the complexity penalization methods (SRM, GCV, etc.)
can be forced into a regime of constant catastrophe, CVT noticeably degrades, and
yet TRI retains performance similar to the levels shown in table 23.1.
Of course, these results might be due to considering a pathological target function
from the perspective of polynomial curve-fitting. It is therefore important to
consider other more natural targets that might be better suited to polynomial
approximation. In fact, by repeating the previous experiments with a more benign
target function, f (x) = sin2 (2πx), quite different results are obtained. Table 23.3
shows that procedure TRI does not fare as well in this case—obtaining a median
approximation ratio of 3.51 (compared to 1.03 for SRM, and 1.16 for CVT). A
closer inspection of TRI’s behavior reveals that the reason for this performance
drop is that TRI systematically gets stuck at low even-degree polynomials (cf.
table 23.5). In fact, there is a simple geometric explanation for this. The even-
432 Metric-Based Approaches for Semi- Supervised Regression and Classification
degree polynomials (after degree 4) all give reasonable fits to sin2 (2πx) whereas
the odd-degree fits have a tail in the wrong direction. This creates a significant
distance between successive polynomials and causes the triangle inequality test
to fail between the even- and odd-degree fits, even though the larger even-degree
polynomials give a good approximation. Therefore, although the metric-based TRI
strategy is robust against overfitting, it can be prone to systematic underfitting
in seemingly benign cases. Similar results were obtained for fitting a fifth-degree
target polynomial corrupted by the same level of Gaussian noise (table 23.4). This
problem demonstrates that the first assumption used in proposition 23.1 above can
be violated in natural situations (see table 23.5). Consideration of this difficulty
leads to the development of a reformulated procedure.
hk ......................... hℓ
..b
..b ...
...b .
....
b ...
....
b
.. ...
b
b
PY|X
Figure 23.4 The real and estimated distances between successive hypotheses hk and
hℓ and the target PY|X . Solid lines indicate real distances, dotted lines indicate empirical
distance estimates.
Assume for the sake of argument that d˜ = d (i.e., our estimate of interhypothesis
distance, based on unlabeled data, is the true distance). The final idea explored
comparing d and for model selection is to observe that there would then be two metrics — the
dˆ true metric d defined by the joint distribution PXY and an empirical metric dˆ
determined by the labeled training sequence x1 , y1 , . . . , xl , yl . Note that the
previous model selection strategy TRI ignored the fact that one could measure the
empirical distance between hypotheses d(hˆ k , hℓ ) on the labeled training data, as well
as estimate their “true” distance d(hk , hℓ ) on the unlabeled data. However, the fact
that one can measure both interhypothesis distances actually gives an observable
relationship between dˆ and d in the local vicinity. This observation is now exploited
in an attempt to derive an improved model selection procedure.
Given the two metrics d and d, ˆ consider the triangle formed by two hypotheses
hk and hℓ and the target conditional PY|X (figure 23.4). Notice that there are six
distances involved—three real and three estimated—of which the true distances to
PY|X are the only two of importance, and yet these are the only two that are not
available. However, the observed relationship between d and dˆ can be exploited
adjustment of to adjust the empirical training error estimate d(P ˆ Y|X , hℓ ). In fact, one could first
error estimates consider the simplest possible adjustment based on the naive assumption that the
observed relationship of the metrics dˆ and d between hk and hℓ also holds between
23.3 Model Selection 433
hℓ and PY|X . Note that if this were actually the case, a better estimate of d(PY|X , hℓ )
could be obtained by simply rescaling the training distance d(P ˆ Y|X , hℓ ) according to
˜ k , hℓ )/d(h
the observed ratio d(h ˆ k , hℓ ). Since dˆ is expected to be an underestimate
in general, because we assume the hk are chosen by minimizing d, ˆ this ratio should
be larger than 1. In fact, adopting this as a simple heuristic yields another model
selection procedure, ADJ, which is also surprisingly effective (algorithm 23.2). This
simple procedure overcomes some of the underfitting problems associated with TRI
and yet retains much of TRI’s robustness against overfitting.
Although at first glance this procedure might seem to be ad hoc, it turns out that
one can prove an overfitting bound for ADJ that is analogous to that established for
TRI. In particular, if one assumes that ADJ makes it to the best hypothesis hm in
ˇ Y|X , hm ) is an underestimate, then
the sequence, and the adjusted error estimate d(P
ADJ cannot overfit by a factor much greater than 3. Again, the formal proposition
is stated, but refer to Schuurmans and Southey (2002) for a proof.
In this respect, not only does ADJ exhibit robustness against overfitting, it
also has a (weak) theoretical guarantee against underfitting. That is, with the
assumptions that the empirical distance estimates are underestimates and that the
adjusted distance estimates strictly increase the empirical distance estimates, then
if the true error of a successor hypothesis hm improves the true error of all of its
predecessors hℓ by a significant factor, hm will be selected in lieu of its predecessors.
See Schuurmans and Southey (2002) for a proof of this proposition.
434 Metric-Based Approaches for Semi- Supervised Regression and Classification
Proposition 23.3 Consider a hypothesis hm , and assume that (i) d(P ˆ Y|X , hℓ ) ≤
ˆ Y|X , hℓ ) ≤ d(P
d(PY|X , hℓ ) for all 0 ≤ ℓ ≤ m, and (ii) d(P ˇ Y|X , hℓ ) for all 0 ≤ ℓ < m.
Then if
2
ˆ Y|X , hℓ )
1 d(P
d(PY|X , hm ) < (23.8)
3 d(PY|X , hℓ )
for all 0 ≤ ℓ < m (that is, d(PY|X , hm ) is sufficiently small) it follows that
ˇ Y|X , hm ) < d(P
d(P ˇ Y|X , hℓ ) for all 0 ≤ ℓ < m, and therefore ADJ will not choose
any predecessor of hm .
Table 23.5 Strengths of the assumptions used in propositions 23.1 and 23.2. Table shows
frequency (in percent) that the assumptions hold over 1000 repetitions of the experiments
conducted in tables 23.1, 23.2, 23.3, and 23.4 (at sample size l = 20).
Therefore, although ADJ might not have originally appeared to be well moti-
vated, it possesses worst-case bounds against overfitting and underfitting that are
different from those that have been established for conventional methods. How-
ever, these bounds remain somewhat weak. Table 23.5 shows empirical results on
the frequency with which the underlying assumptions hold on experimental data,
demonstrating that both ADJ and TRI systematically underfit in the experiments.
That is, even though assumption (ii) of proposition 23.1 is almost always satisfied
(as expected), assumption (ii) of proposition 23.2 is only true one quarter of the
time. Therefore, propositions 23.1 and 23.2 can only provide a loose characteriza-
tion of the quality of these methods. However, both metric-based procedures remain
robust against overfitting.
To demonstrate that ADJ is indeed effective, the previous experiments were
repeated with ADJ as a new competitor. The results show that ADJ robustly
outperformed the standard complexity penalization and holdout methods in all
cases considered—spanning a wide variety of target functions, noise levels, and
domain distributions PX . Tables 23.1 through 23.4 show the previous data along
with the performance characteristics of ADJ. In particular, tables 23.3, 23.4, and
23.5 show that ADJ avoids the extreme underfitting problems that hamper TRI; it
23.3 Model Selection 435
Table 23.6 Fitting f (x) = step(x ≥ 0.5) with PX = U (0, 1) and σ = 0.05 (as in table 23.1).
This table gives distribution of approximation ratios achieved with l = 30 labeled training
examples and u = 500, u = 200, u = 100, u = 50, u = 25 unlabeled examples, showing
percentiles of approximation ratios achieved after 1000 repeated trials. The experimental
setup of table 23.1 is repeated, except that a smaller number of unlabeled examples are
used.
case studies, the focus now changes to a further improvement in the basic method.
23.4 Regularization
One difficulty when doing model selection is that the generalization behavior
depends on the specific decomposition of the base hypothesis class into subclasses.
That is, different decompositions of H can lead to different outcomes. To avoid this
issue, the previous ideas need to be extended to a more general training criterion
penalizing that uses unlabeled data to decide how to penalize individual hypotheses in the
individual global space H. The main contribution of this section is a simple, generic training
hypotheses objective that can be applied to a wide variety of supervised learning problems.
As before, assume a sizable collection of unlabeled data that can now be used to
globally penalize complex hypotheses. Specifically, an alternative training criterion
can be formulated that measures the behavior of individual hypotheses on both
the labeled and unlabeled data. The intuition behind this criterion is simple—
instead of minimizing empirical training error alone, also seek hypotheses that
behave similarly both on and off the labeled training data. This objective arises
from the observation that a hypothesis which fits the training data well but behaves
erratically off the labeled training set is not likely to generalize to unseen examples.
origin hypothesis To detect such behavior one can measure the distances of a hypothesis from a fixed
simple “origin” function φ on both data sets. If a hypothesis is behaving erratically
23.4 Regularization 437
off the labeled training set, then it is likely that these two distances will disagree.
This effect is demonstrated in figure 23.5 for two large-degree polynomials that
both fit the labeled training data well but differ dramatically in their true error
and their differences between distances, both on and off the training set, to the
origin function. Trivial origin functions are used throughout this section—such as
the zero function, φ = 0, or the constant function at the mean of the y labels, φ = ȳ.
In practice, these work quite well.
ˆ PY|X ) = 0.004
d(h,
h
3
d(h, PY|X ) = 193.1
2.5
ˆ PY|X ) = 0.101
d(g,
2 d(g, PY|X ) = 0.543
1.5
φ
1
ˆ h) = 1.014
d(φ,
˜
d(φ, h) = 192.4
0.5
g
0
ˆ φ) = 1.010
d(g,
−0.5
˜ φ) = 0.928
d(g,
−1
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Figure 23.5 Two nineteenth-degree polynomials h and g that fit 20 given training
points. Here h approximately minimizes d(h, ˆ PY|X ), whereas g optimizes an alternative
training criterion defined in (23.10). This plot demonstrates how the labeled training
ˆ PY|X ) for the smoother polynomial g is much closer to its true distance
data estimate d(g,
d(g, PY|X ). However, for both functions the proximity of the estimated errors d(·,ˆ PY|X ) to
the true errors d(·, PY|X ) appears to be reflected on the relative proximity of the estimated
ˆ φ) to the unlabeled distances d(·,
distances d(·, ˜ φ) to the simple constant origin function
φ.
ˆ PY|X ) = 0.101
d(g,
3
d(g, PY|X ) = 0.543
2.5
ˆ PY|X ) = 0.098
d(f,
f
2 d(f, PY|X ) = 0.488
1.5
φ
1
ˆ φ) = 1.010
d(g,
˜ φ) = 0.928
d(g,
0.5
g
0
ˆ φ) = 1.011
d(f,
−0.5
˜ φ) = 1.023
d(f,
−1
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Figure 23.6 A comparison of the asymmetric and symmetrized training objectives. Here
g is the nineteenth-degree polynomial which minimizes the original asymmetric criterion
(23.10) on 20 data points, whereas f minimizes the symmetrized criterion (23.12). This
plot shows how g is inappropriately drawn toward the origin φ near the right end of the
interval, whereas f behaves neutrally with respect to φ.
the penalization to a particular set of observed data. This raises the possibility
of outperforming any regularization scheme that keeps a fixed penalization level
across different training samples drawn from the same problem. In fact, such an
improvement can be achieved in realistic hypothesis classes on real data sets—as
shown in the next section.
One drawback with the minimization objective in Eq. 23.12 is that it is not convex
nonconvex and therefore local minima likely exist. Typically one has to devise reasonable
optimization initialization and restart procedures to effectively minimize such an objective. Here
we simply started the optimizer from the best-fit polynomial of each degree, or in
the case of radial basis function (RBF) regularization (below), we started from a
single initialization point. Once initialized, a standard optimization routine (Matlab
5.3 “fminunc”) was used to determine coefficients that minimized Eqs. 23.11 and
23.12. Although the nondifferentiability of Equation 23.12 creates difficulty for the
optimizer, it does not prevent reasonable results from being achieved. Therefore,
we did not find it necessary to smooth the objective with a softmax, although this
is a reasonable idea. Another potential problem could arise if h gets close to the
origin φ. However, since simple origins were chosen that were never near PY|X , h was
not drawn near φ in these experiments and thus the resultant numerical instability
did not arise.
The first supervised learning task considered is the polynomial regression problem
from section 23.3.2. The regularizer introduced above (Eq. 23.12) turns out to
perform very well in such problems. In this case, our training objective can be
440 Metric-Based Approaches for Semi- Supervised Regression and Classification
where {xi , yi }li=1 is the set of labeled training data, {xj }nj=l+1 is the set of
unlabeled examples, and φ is a fixed origin function (usually set to the constant
function at the mean of the y labels). Note again that this training objective seeks
hypotheses that fit the labeled training data well while simultaneously behaving
similarly on labeled and unlabeled data.
To test the basic effectiveness of this approach, the experiments of section 23.3.2
were repeated. The first class of methods compared against were the same model
selection methods considered before: tenfold cross-validation CVT, structural risk
minimization SRM (Cherkassky et al., 1997), RIC (Foster and George, 1994);
SMS (Shibata, 1981), GCV (Craven and Wahba, 1979), BIC (Schwarz, 1978), AIC
(Akaike, 1974), CP (Mallows, 1973), FPE (Akaike, 1970), and the metric-based
model selection strategy, ADJ, introduced in section 23.3.3. However, since none of
the classical model selection methods performed competitively in these experiments,
they are not reported here (see Schuurmans and Southey (2002) for more complete
results). Instead, for comparison, results are reported for the optimal model selector,
OPT*, which makes an oracle choice of the best available hypothesis in any given
model selection sequence based on the test data. In these experiments, the model
selection methods considered polynomials of degree 0 to l − 2.5
Table 23.7 Fitting f (x) = step(x ≥ 0.5) with PX = U (0, 1) and σ = 0.05. Absolute
test errors (true distances) achieved. Results of 1000 repeated trials. This repeats the
conditions of table 23.1.
mean median stdev
ADA (23.12) φ = mean y 0.391 0.366 0.113
asymmetric (23.10) 0.403 0.378 0.111
REG λ = 1.0 0.483 0.468 0.048
REG* 0.371 0.355 0.049
model sel OPT* 0.387 0.374 0.076
ADJ 0.458 0.466 0.112
5. Note that the degree is restricted to be less than l − 1 to prevent the maximum degree
polynomials from achieving zero training error which, as discussed in section 23.3, destroys
the regularization effect of the multiplicative penalty.
23.4 Regularization 441
Table 23.8 Fitting f (x) = sin(1/x) with PX = U (0, 1) and σ = 0.05. Absolute test errors
(true distances) achieved. Results of 1000 repeated trials. This repeats the conditions of
table 23.2.
mean median stdev
ADA (23.12) φ = mean y 0.444 0.425 0.085
asymmetric (23.10) 0.466 0.439 0.102
REG λ = 1.0 0.484 0.473 0.040
REG* 0.429 0.424 0.041
model sel OPT* 0.433 0.427 0.049
ADJ 0.712 0.504 0.752
Table 23.9 Fitting f (x) = sin2 (2πx) with PX = U (0, 1) and σ = 0.05. Absolute test errors
(true distances) achieved. Results of 1000 repeated trials. This repeats the conditions of
table 23.3.
mean median stdev
ADA (23.12) φ = mean y 0.107 0.081 0.066
asymmetric (23.10) 0.111 0.087 0.060
REG λ = 5.0 0.353 0.341 0.040
REG* 0.140 0.092 0.099
model sel OPT* 0.122 0.085 0.086
ADJ 0.188 0.114 0.150
The second class of methods compared against were regularization methods that
consider polynomials of maximum degree l − 2 but penalize individual polynomials
based on the size of their coefficients or their smoothness properties. The specific
methods considered were a standard form of “ridge” penalization (or weight de-
cay) which places a penalty λ k a2k on polynomial coefficients ak (Cherkassky
and Mulier, 1998), and Bayesian maximum a posteriori inference with zero-mean
Gaussian priors on polynomial coefficients ak with diagonal covariance matrix λI
(MacKay, 1992). Both of these methods require a regularization parameter λ to be
set by hand. These methods are referred to as REG and MAP respectively.
To test the ability of the new regularization technique to automatically set
the regularization level, a range of (fourteen) regularization parameters λ were
tried for the fixed regularization methods REG and MAP, showing the single
best value of λ obtained on the test data. For comparison purposes, the results
of the oracle regularizer, REG*, is also reported. This oracle selects the best
λ value for each training set based on examining the test data (MAP* gives
similar results here (Schuurmans and Southey, 2002)). The experiments were
conducted by repeating the conditions of section 23.3.2. Specifically, table 23.7
repeats table 23.1 (fitting a step function), table 23.8 repeats table 23.2 (fitting
442 Metric-Based Approaches for Semi- Supervised Regression and Classification
Table 23.10 Fitting a fifth-degree polynomial with PX = U (0, 1) and σ = 0.05. Absolute
test errors (true distances) achieved. Results of 1000 repeated trials. This repeats the
conditions of table 23.4.
mean median stdev
ADA (23.12) φ = mean y 0.077 0.060 0.090
asymmetric (23.10) 0.110 0.074 0.088
−1
REG λ = 10 0.454 0.337 0.508
REG* 0.147 0.082 0.121
model sel OPT* 0.071 0.060 0.071
ADJ 0.116 0.062 0.188
sin(1/x)), table 23.9 repeats table 23.3 (fitting sin2 (2πx)), and table 23.10 repeats
table 23.4 (fitting a fifth-degree polynomial). The regularization criterion based on
minimizing Eq. 23.12 is listed as ADA in our figures (for “adaptive” regularization).
Additionally, the asymmetric version of ADA (23.10) was tested to verify the
benefits of the symmetrized criterion (23.12).
The results are positive. The new adaptive regularization scheme ADA performed
the best among all procedures in these experiments. Tables 23.7 through 23.10
show that it outperformed the fixed regularization strategy REG for the best fixed
choice of regularization parameter (λ), even though the optimal choice varies across
problems. This demonstrates that ADA is able to effectively tune its penalization
behavior to the problem at hand. Moreover, since it outperforms even the best
choice of λ for each data set, ADA also demonstrates the ability to adapt its
penalization behavior to a specific training set, not just a given problem. In fact,
ADA is competitive with the oracle regularizer REG* in these experiments, and
even sometimes outperformed the oracle model selection strategy OPT*. The results
also show that the asymmetric version of ADA based on (23.10) is inferior to the
symmetrized version in these experiments, confirming our prior expectations.
To test the approach on a more realistic task, the problem of regularizing radial
basis function (RBF) networks for regression was considered. RBF networks are
a natural generalization of interpolation and spline-fitting techniques. Given a set
of prototype centers c1 , ..., ck , an RBF representation of a prediction function h is
given by
k
x − ci
h(x) = wi g , (23.13)
i=1
σ
where x− ci is the Euclidean distance between x and center ci and g is a response
function with width parameter σ. In this experiment a standard local Gaussian basis
23.4 Regularization 443
2 2
function, g(z) = e−z /σ , was used.
Fitting with RBF networks is straightforward. The simplest approach is to place
RBF networks a prototype center on each training example and then determine the weight vector,
w, that allows the network to fit the training labels. The best-fit weight vector can
be obtained by solving for w in
⎡ ⎤⎡ ⎤ ⎡ ⎤
g x1 −x σ
1
· · · g x1 −xl
σ w1 y1
⎢ ⎥⎢
⎢ .. .. ⎥ ⎢ .. ⎥ ⎢ . ⎥
⎢ . . ⎥⎣ . ⎥ ⎦ =⎢ . ⎥
⎣ . ⎦.
⎣ ⎦
g xl −xσ
l
· · · g xl −x σ
l
wl yl
The solution is guaranteed to exist and be unique for distinct training points and
most natural basis functions, including the Gaussian basis used here (Bishop, 1995).
Although exactly fitting data with RBF networks is natural, it has the prob-
regularized RBF lem of generally overfitting the training data in the process of replicating the y
networks labels. Many approaches therefore exist for regularizing RBF networks. However,
these techniques are often hard to apply because they involve setting various free
parameters or controlling complex methods for choosing prototype centers, etc.
(Cherkassky and Mulier, 1998; Bishop, 1995). The simplest regularization approach
is to add a ridge penalty to the weight vector, and minimize
l
l
(h(xi ) − yi )2 + λ wi2 , (23.14)
i=1 i=1
Table 23.11 RBF results showing mean test errors (distances) on the AAUP data set
(1074 instances on 12 independent attributes). Results are averaged over 100 splits of the
data set.
ADA (23.12) 0.0197 ± 0.004 | REG* 0.0329 ± 0.009
REG λ=0.0 0.1 0.25 0.5 1.0
σ= 0.0005 0.0363 0.0447 0.0482 0.0515 0.0554
0.001 0.0353 0.0435 0.0475 0.0512 0.0554
0.0025 0.0350 0.0425 0.0473 0.0514 0.0555
0.005 0.0359 0.0423 0.0475 0.0516 0.0554
0.0075 0.0368 0.0424 0.0478 0.0517 0.0553
Table 23.12 RBF results showing mean test errors (distances) on the ABALONE data
set (1000 instances on 8 independent attributes). Results are averaged over 100 splits of
the data set.
ADA (23.12) 0.034 ± 0.0046 | REG* 0.049 ± 0.0063
REG λ=0.0 0.1 0.25 0.5 1.0
σ= 4 0.4402 0.04954 0.04982 0.05008 0.05061
6 0.3765 0.04952 0.04979 0.05007 0.05063
8 0.3671 0.04951 0.04979 0.05007 0.05069
10 0.3474 0.04952 0.04979 0.05007 0.05073
12 0.3253 0.04953 0.04979 0.05008 0.05079
repositories were investigated.6 In the experiments, a given data set was randomly
split into training (1/10), unlabeled (7/10), and test (2/10) sets. Each of the
methods was then run on this split—this process being repeated 100 times for each
data set to obtain results. Tables 23.11 through 23.14 show that ADA regularization
was able to choose width and regularization parameters that achieved effective
generalization performance across a range of data sets. The loss for ADA and
REG* are given at the top of each table and the loss for each fixed parameter
setting is given below. The best such setting is italicized. Furthermore, all settings
that outperform ADA are shown in bold. Therefore, tables showing few bold entries
indicate that ADA is outperforming most fixed regularizers.
On these data sets, ADA performs better than any fixed regularizer on every
problem (except BODYFAT). This shows that the adaptive criterion is not only
effective at choosing good regularization parameters for a given problem but can
choose them adaptively based on the specific sample of training data given, yielding
improvements over fixed regularizers.
Table 23.13 RBF results showing mean test errors (distances) on the BODYFAT data
set (252 instances on 14 independent attributes). Results are averaged over 100 splits of
the data set.
ADA (23.12) 0.131 ± 0.0171 | REG* 0.125 ± 0.0151
REG λ=0.0 0.1 0.25 0.5 1.0
σ= 0.1 0.1658 0.1299 0.1325 0.1341 0.1354
0.5 0.1749 0.1294 0.1321 0.1337 0.1352
1 0.1792 0.1294 0.1321 0.1336 0.1353
2 0.1837 0.1296 0.1322 0.1337 0.1356
4 0.1883 0.1299 0.1323 0.1339 0.1362
Table 23.14 RBF results showing mean test errors (distances) on the BOSTON-C data
set (506 instances on 12 independent attributes). Results are averaged over 100 splits of
the data set.
ADA (23.12) 0.150 ± 0.0212 | REG* 0.151 ± 0.0197
REG λ=0.0 0.1 0.25 0.5 1.0
σ= 0.075 0.1619 0.15785 0.1614 0.1645 0.1679
0.1 0.1624 0.15779 0.1614 0.1645 0.1679
0.15 0.1633 0.15776 0.1615 0.1646 0.1680
0.2 0.1642 0.15777 0.1615 0.1647 0.1682
0.25 0.1649 0.15780 0.1616 0.1648 0.1683
23.5 Classification
l
ˆ 1 φ(xi ) 1 − φ(xi )
d(φh) = φ(xi ) log + (1 − φ(xi )) log , (23.16)
l i=1 h(xi ) 1 − h(xi )
l
ˆ Y|X h) = 1
d(P −yi log h(xi ) − (1 − yi ) log(1 − h(xi )). (23.17)
l i=1
7. Note that KL divergence is not a proper distance metric but it is frequently used in
such contexts.
8. For the sake of simplicity, only binary classification is considered.
23.5 Classification 447
Table 23.15 Logistic regression (LR) results for six book data sets showing mean testing
error (log-loss) for ADA and regularized LR with various settings of λ.
Table 23.16 Logistic regression (LR) results for six UCI data sets showing mean testing
error for ADA and regularized LR with various settings of λ.
AUST. CRX DIAB. FLARE GERM. PIMA
ADA 0.697 0.716 0.703 0.541 0.697 0.683
λ=0 1.240 1.176 1.282 1.741 0.710 1.442
0.1 0.927 0.797 0.785 0.833 0.715 0.881
0.5 0.814 0.707 0.733 0.618 0.715 0.773
1.0 0.773 0.689 0.716 0.572 0.713 0.739
2.0 0.742 0.679 0.703 0.546 0.710 0.715
5.0 0.715 0.676 0.694 0.533 0.703 0.697
10.0 0.704 0.678 0.691 0.531 0.697 0.692
in table 23.17 for the book data sets 9 and in table 23.18 for the UCI data. Like
the earlier regression results, the best fixed parameter setting is italicized and all
settings that outperform ADA are shown in bold.
On the book data, the results are excellent, beating the oracle regularizer on all
but Digit1 and coming very close even there. On the UCI data, the results are
more mixed but still quite positive. While the oracle is not surpassed on any data
set, ADA is still better than many fixed regularizers.
Table 23.17 Kernel logistic regression (KLR) results for six book data sets showing
mean testing error for ADA and regularized KLR with various settings of λ and σ.
Digit1 ADA 0.518 USPS ADA 0.456
σ= 0.1 0.5 1 5 10 σ= 0.1 0.5 1 5 10
λ=0 0.693 0.691 0.572 0.569 0.701 λ=0 0.693 0.693 0.691 0.478 0.480
0.1 0.693 0.692 0.636 0.690 0.723 0.1 0.693 0.693 0.692 0.444 0.477
0.5 0.693 0.693 0.667 0.716 0.725 0.5 0.693 0.693 0.693 0.481 0.498
1.0 0.693 0.693 0.677 0.718 0.724 1.0 0.693 0.693 0.693 0.503 0.504
2.0 0.693 0.693 0.684 0.717 0.721 2.0 0.693 0.693 0.693 0.531 0.511
5.0 0.693 0.693 0.689 0.712 0.715 5.0 0.693 0.693 0.693 0.578 0.526
10.0 0.693 0.693 0.691 0.706 0.709 10.0 0.693 0.693 0.693 0.615 0.549
9. We presume the similar scores achieved by so many of the fixed regularizes on the book
data are due to some regularity in those data.
23.6 Conclusion 449
Table 23.18 Kernel logistic regression (KLR) results for six UCI data sets showing
mean testing error for ADA and regularized KLR with various settings of λ and σ.
AUSTRALIAN ADA 0.685 CRX ADA 1.111
σ= 0.1 0.5 1 5 10 σ= 0.1 0.5 1 5 10
λ=0 0.851 0.772 0.748 0.708 0.710 λ=0 1.141 1.153 1.033 0.946 0.851
0.1 0.670 0.681 0.682 0.705 0.705 0.1 0.770 0.826 0.826 0.830 0.787
0.5 0.653 0.671 0.682 0.703 0.705 0.5 0.703 0.760 0.779 0.798 0.779
1.0 0.654 0.671 0.683 0.702 0.704 1.0 0.689 0.739 0.762 0.784 0.772
2.0 0.658 0.673 0.685 0.701 0.703 2.0 0.681 0.721 0.744 0.767 0.762
5.0 0.667 0.674 0.685 0.697 0.699 5.0 0.679 0.700 0.720 0.742 0.742
10.0 0.675 0.677 0.685 0.694 0.696 10.0 0.682 0.690 0.704 0.723 0.725
neural networks are striking, dramatically reducing the tendency to overfit, even
as the model complexity increases (performance on the PIMA data set with ten
hidden nodes is the only notable anomaly to be found).
Overall, these results show considerable promise for the use of ADA with prob-
abilistic classifiers, but there are clearly improvements still to be made. Adapting
the technique to work with discrete classifiers also remains a key challenge.
23.6 Conclusion
Table 23.19 Neural network (NN) results for the book data sets (except set 6) showing
mean testing error for ADA and unregularized NN with 3, 5, and 10 hidden nodes.
hidden=3 Digit1 USPS COIL0,1 BCI g241c g241d
ADA 0.756 0.579 11.282 1.162 2.120 1.108
unreg NN 84.567 51.020 22.769 154.388 122.308 160.653
hidden=5 Digit1 USPS COIL0,1 BCI g241c g241d
ADA 0.829 1.422 2.998 1.324 30.349 3.108
unreg NN 77.577 47.166 41.629 165.090 151.790 139.809
hidden=10 Digit1 USPS COIL0,1 BCI g241c g241d
ADA 1.828 9.985 2.070 0.993 4.742 1.253
unreg NN 83.693 61.913 24.233 118.572 124.658 142.555
Table 23.20 Neural network (NN) results for six UCI data sets showing mean testing
error for ADA and unregularized NN with 3, 5, and 10 hidden nodes.
hidden=3 AUST. CRX DIAB. FLARE GERM. PIMA
ADA 0.90 0.78 2.45 0.64 0.64 0.93
unreg NN 34.40 79.53 13.95 40.73 0.64 8.87
hidden=5 AUST. CRX DIAB. FLARE GERM. PIMA
ADA 1.53 1.19 1.71 0.53 0.82 0.89
unreg NN 41.13 88.43 46.47 62.41 0.73 58.43
hidden=10 AUST. CRX DIAB. FLARE GERM. PIMA
ADA 1.09 1.33 2.10 0.72 1.03 11.64
unreg NN 110.13 48.96 30.23 80.88 13.89 55.94
is “no free lunch” in general (Schaffer, 1994) and a universal improvement cannot be
claimed for every complexity-control problem (Schaffer, 1993), one should be able
to exploit additional information about the task (i.e., knowledge of PX ) to obtain
significant improvements across a wide range of problem types and conditions. The
empirical results support this view. Furthermore, ADJ remains very competitive
with newer model-selection techniques (Bengio and Chapados, 2003). Additionally,
ADJ has been independently extended along three lines (Chapelle et al., 2002): (i)
producing excellent results on time-series data, (ii) using estimated densities in lieu
of unlabeled data, and (iii) hybridizing ADJ with cross-validation.
An important direction for future research is to develop theoretical support for
these strategies—in particular, a stronger theoretical justification of the regulariza-
tion methods proposed in section 23.4, an improved analysis of the model selection
methods proposed in section 23.3, and investigation of how to apply the technique
in section 23.5 to a more general set of classifiers . It remains open as to whether
the proposed methods TRI, ADJ, and ADA are in fact the best possible ways to ex-
ploit the hypothesis distances provided by PX . A clear direction for future research
is the investigation of alternative strategies that could potentially be more effective
in this regard. For example, it remains for future work to extend the multiplicative
ADJ and ADA methods to cope with zero training errors. Additionally, more ex-
ploration of the effects of alternative origin functions (perhaps even ensembles of
origin functions) is necessary. Finally, it would be interesting to adapt the approach
to model combination methods, extending the ideas of Krogh and Vedelsby (1995)
to other combination strategies, including boosting (Freund and Schapire, 1997)
and bagging (Breiman, 1996).
Acknowledgments
Research was supported by the Alberta Ingenuity Centre for Machine Learning,
NSERC, MITACS, and the Canada Research Chair programme.
24 Transductive Inference and
Semi-Supervised Learning
This chapter discusses the difference between transductive inference and semi-
supervised learning. It argues that transductive inference captures the intrinsic
properties of the mechanism for extracting additional information from the unla-
beled data. It also shows an important role of transduction for creating noninductive
models of inference.1
Let us start with the formal problem setting for transductive inference and semi-
supervised learning.
Transductive Inference: General Setting Given a set of ℓ training pairs,
1. These remarks were inspired by the discussion, What is the Difference between Trans-
ductive Inference and Semi-Supervised Learning?, that took place during a workshop close
to Tübingen, Germany (May 24, 2005).
454 Transductive Inference and Semi-Supervised Learning
the one that classifies the test vectors with the smallest number of errors. Here we
consider
Therefore, in transductive inference the goal is to classify the given u test vectors
of interest while in semi-supervised learning the goal is to find the function that
minimizes the functional (24.4) (the expectation of the error).
Semi-supervised learning can be seen as being related to a particular setting of
transductive learning. Indeed, if one chooses the function to classify the given test
data (24.2) well, why not also use it to classify new unseen data? This looks like a
reasonable idea.
However from a conceptual point of view, transductive inference contains im-
portant elements of a new philosophy of inference and this is the subject of these
remarks.
The transductive mode of inference was introduced in the mid-1970s. It attempts
to estimate the values of an unknown function f (x, α0 ) at particular points of
interest. On the other hand, inductive inference attempts to estimate the unknown
function over its entire domain of definition (Vapnik, 2006). In the late 1970s the
advantage of transductive inference over inductive inference was shown on real life
problems (Vapnik and Sterin, 1977).
The problem of semi-supervised learning was introduced in the mid-1990s (cf. sec-
tion 1.1.3) and became popular in the early 2000s (Zhou et al., 2004).
24.2 Problem of Generalization in Inductive and Transductive Inference 455
The mechanism that provides the transductive mode of inference with an advantage
over the inductive mode in classification of the given points of interest has been
understood since the very first theorems of Vapnik-Chervonenkis (VC) theory were
proved.
Suppose that our goal is to find the function that minimizes the functional (24.4).
Since the probability measure in (24.4) is unknown we minimize the empirical risk
functional
ℓ
Remp (α) = |yi − f (xi , α)| (24.5)
i=1
In 1968 the necessary and sufficient conditions for uniform convergence (24.6)
were discovered (Vapnik and Chervonenkis, 1968, 1971). They are based on the so-
called capacity factors. These factors will play an important role in our discussion.
We now introduce them.
Given a set of indicator functions f (x, α), α ∈ Λ and set of ℓ i.i.d. input vectors
x1 , ..., xℓ , (24.7)
consider the value ∆Λ (x1 , ..., xℓ ) that defines the number of different classifications
of the set of vectors (24.7) using indicator functions from the set f (x, α), α ∈ Λ.
This is the number of equivalence classes2 of functions on which the set of vectors
(24.7) factorizes the set of functions f (x, α), α ∈ Λ. The number of equivalence
classes has the trivial bound
Using the value ∆Λ (x1 , ..., xℓ ) we define the following three capacity concepts.
2. A subset of functions that classify vectors (24.7) in the same way belong to the same
equivalence class (with respect to (24.7)).
456 Transductive Inference and Semi-Supervised Learning
∆Λ
P (ℓ) = Ex1 ,...,xℓ ∆(x1 , ..., xℓ ), (24.9)
where the expectation is taken over i.i.d. data (24.7) drawn according to the
distribution P (x).
The function
HPΛ (ℓ) = ln ∆Λ
P (ℓ) (24.10)
GΛ (ℓ) = 2ℓ (24.14)
holds true is equal to h. If this equality is true for any ℓ we say that the VC
The VC dimension depends only on one factor: (a) the set of functions. VC
dimension characterizes the diversity of this set of functions.
A finite VC dimension is the necessary and sufficient condition for uniform conver-
gence which is independent of the probability measure.
In 1968 we proved the important bound (Vapnik and Chervonenkis, 1968)
Λ ℓ
ln G (ℓ) ≤ h ln + 1 . (24.16)
h
This bound allows one to upper-bound the growth function with a standard function
that depends on one parameter, the VC dimension.
We have therefore obtained the following relationship:
Λ Λ ℓ
HP (ℓ) ≤ ln G (ℓ) ≤ h ln + 1 . (24.17)
h
One can rewrite this expression in the following form: with probability 1 − η
simultaneously for all α the inequality
'
HPΛ (2ℓ) − ln η
R(α) ≤ Remp (α) + (24.19)
ℓ
holds true. Note that this inequality depends on the distribution function P (x).
Since this inequality is true simultaneously for all functions of the admissible set,
the function that minimizes the right-hand side of (24.19) provides the guaranteed
minimum for the expected loss (24.4).
Taking into account (24.17) one can upper-bound (24.19) using the second
capacity concept, the growth function:
'
ln GΛ (2ℓ) − ln η
R(α) ≤ Remp (α) + . (24.20)
ℓ
This bound is true for any distribution function (i.e. for the worst distribution
function). However it is less accurate (for a specific case P (x)) than (24.19).
One can also upper-bound (24.19) and (24.20) using the third capacity concept,
458 Transductive Inference and Semi-Supervised Learning
the VC dimension
*
h(ln 2ℓ
h + 1) − ln η
R(α) ≤ Remp (α) + . (24.21)
ℓ
The good news about this bound is that it depends on just one parameter h and
not on some integer function GΛ (ℓ). However (24.21) is less accurate than (24.20)
which is less accurate than (24.19).
Transductive inference was inspired by the idea of finding better solutions using
the more accurate bound (24.19) instead of the bounds (24.20) and (24.21) used in
inductive inference.
Bounds (24.18) and (24.19) were obtained using the so-called symmetrization
lemma.
Lemma. The following inequality holds true:
5 6 5 4 4 ε6
4 (1) (2) 4
P sup |R(α) − Remp (α)| ≥ ε ≤ 2P sup 4Remp (α) − Remp (α)4 > , (24.22)
α α 2
where
ℓ
(1) 1
Remp (α) = |yi − f (xi , α)| (24.23)
ℓ i=1
and
2ℓ
(2) 1
Remp (α) = |yi − f (xi , α)| (24.24)
ℓ
i=ℓ+1
are the empirical risk functionals constructed using two different samples.
The bound (24.18) was obtained as an upper-bound of the right-hand side of
(24.22).
Therefore, from the symmetrization lemma it follows that to obtain a bound for
inductive inference we first obtained a bound for transductive inference (for the
right-hand side of (24.22)) and then upper-bounded that.
It should be noted that since the bound (24.18) was introduced in 1968, a
lot of efforts were made to improve it. However in all attempts the key element
remained the symmetrization lemma. That is, in all proofs of the bounds for
uniform convergence the first (and most difficult) step was to obtain the bound
for transductive inference. The trivial upper bound of this bound gives the desired
result.
This means that transductive inference is a fundamental step in machine learning
theory.
To get the bound (24.18) let us bound the right-hand side of (24.22). Two
24.5 Bounds for Transductive Inference 459
(b) one chooses an i.i.d. set of size 2ℓ and then randomly splits it into two subsets
of size ℓ.
2. Using model (b) one can rewrite the right-hand side of (24.22) as follows:
2 4 4 3
4 (1) (2) 4
P supα 4Remp (α) − Remp (α)4 > 2ε =
2 3 (24.25)
(1) (2)
E{x1 ,...,x2ℓ } P supα |Remp (α) − Remp (α)| > 2ε | {x1 , ..., x2ℓ } .
and then take the expectation over working sets of size 2ℓ. As a result, we obtain
2 4 4 3
4 (1) (2) 4
E{x1 ,...,x2ℓ } P supα 4Remp (α) − Remp (α)4 > 2ε ≤
= 2 > = Λ > (24.27)
E∆Λ P (2ℓ) exp −ε ℓ = exp HP (2ℓ) − ε ℓ .
2
This bound depends on the probability measure P (x) (it contains the term HPΛ (2ℓ)).
To obtain a bound which is independent of the probability measure we upper-bound
HPΛ (2ℓ) by GΛ (2ℓ) (see (24.17)). Since GΛ (2ℓ) is independent of the probability
measure we obtain the bound
5 6
= >
P sup |R(α) − Remp (α)| ≥ ε ≤ GΛ (2ℓ) exp −ε2 ℓ (24.28)
α
The inequality (24.26) is the key element for obtaining a VC bound for transductive
inference.
Indeed, this inequality is equivalent to the following one: with probability 1 − η
holds true, where probability is defined with respect to splitting the set {x1 , ..., x2ℓ }
into two subsets:
1. one that is used in the training set x1 , ..., xℓ and
2. one that forms the test set xℓ+1 , ..., x2ℓ .
Note that this concept of probability is different from the one defined for inductive
inference and which requires the i.i.d. distribution of the elements x1 , ..., x2ℓ . The
concepts of probability will be equivalent if an element of the working set is
i.i.d. according to some unknown fixed probability distribution function. If it is not,
then all formal claims are still correct but the concept of probability is changing.
In this sense we discuss in section 24.11.1 the idea of adaptation in transductive
inference.
But even in the i.i.d. case the bound for transduction is more accurate than
(24.20) and (24.21) used in inductive inference. However, the main advantage
of transduction over induction appears when one implements the structural risk
minimization principle.
24.6 The Structural Risk Minimization Principle for Induction and Transduction
In the 1970s the structural risk minimization (SRM) principle was introduced.
Its goal was to find the function that minimizes the right-hand side of inequality
(24.19). In order to achieve this goal the following scheme was considered.
Prior to the appearance of the training set, the set of admissible functions is
organized as a structure. The nested subsets of functions (called the elements of
the structure) are specified:
inference.
The SRM principle for transductive inference can be introduced as follows
(Vapnik, 2006): Prior to splitting the given working set x1 , ..., x2ℓ into the two
subsets that define the elements of the training and test sets, one constructs the
structure on the finite number N = ∆Λ (x1 , ..., x2ℓ ) of equivalence classes F1 , ..., FN
that are the result of factorization of the given set of functions over the given 2ℓ
vectors.6
Let such a structure be
where the subset Sk∗ contains Nk equivalence classes of functions from f (x, α), α ∈
Λ.
The opportunity to construct a “smart” structure on the elements of the equiv-
alence classes is a key advantage of SRM for transductive inference over SRM for
inductive inference.
The new development in SRM for transductive inference comes from the con-
sideration of the different “sizes” of the equivalence classes. The idea of creating a
smart structure on the set of equivalence classes due to their size remains the hierar-
chical Bayesian approach. In this approach one can distinguish two (several) levels
of hierarchy: Suppose that we are given a priori information P (α) on the set of ad-
missible functions (before the set of vectors x1 , ..., x2ℓ appear). After these vectors
appear one can calculate prior information for equivalence classes μ(F1 ), ..., μ(FN )
as an integral
μ(Fk ) = dP (α).
Fk
Using this prior information one can construct a “smart” structure where the first
element contains N1 equivalence classes with the largest values μ(Fi ), i = 1, ..., N ,
the second element contains N2 equivalence classes with the largest value μ(Fi ),
and so on.
Note that for transductive inference the construction of such a structure for a
given working set is a prior process since we do not use both the split of our x
vectors into the training and test subsets, and information about the classification
of the training data.7
6. The functions that take the same values on the working set of vectors x 1 , ..., x2ℓ form
one equivalence class (with respect to the working set).
7. One can unify transductive and inductive inference as follows: In both cases one is
given a set of functions defined on some space. One uses the training examples from this
space to define the values of the function of interest for the whole space of definition of the
function. The difference is that in transductive inference the space of interest is discrete
(defined on the working set (24.3)) while in inductive inference it is Rd . One can conduct
a nontrivial analysis of the discrete space but not the space Rd . This defines the key factor
of the advantage of transductive inference.
462 Transductive Inference and Semi-Supervised Learning
For any element Sk of the structure, simultaneously for all equivalence classes
belonging to this element, with probability 1− η the following inequality holds true:
2ℓ ℓ '
1 1 ln Nk − ln η
|yi − Fr (xi )| ≤ |yi − Fr (xi )| + , Fr ∈ Sk . (24.32)
ℓ ℓ i=1 ℓ
i=ℓ+1
The probability is defined with respect to a random split of the set of vectors (24.3)
into two subsets: training and test vectors.8
Therefore, to minimize the number of errors on the test vectors (the left-hand
side of (24.32)) we have to choose the element of the structure Sk (it defines the
value of the second term in the right-hand side of (24.32)) and the equivalence class
belonging to this element (it defines the value of the first term in the right-hand
side of (24.32)).
When constructing structures on the set of equivalence classes in discrete space one
can play combinatorial tricks. This is impossible when constructing a structure on
the set of functions defined in the whole space.
Suppose we are given a working set of size 2ℓ which forms our discrete space. Sup-
pose in this space we have N equivalence classes F1 , ..., FN of functions f (x, α), α ∈
Λ.
Consider 2ℓ new problems described by 2ℓ discrete spaces: S 1 , ...., S 2ℓ , where the
discrete space S r is defined by working vectors (24.3) from which we removed the
vector xr . For each of these spaces we can construct a set of equivalence classes
and a corresponding structure on this set. For each of these classes with probability
1 − η the inequality (24.32) holds true and therefore simultaneously for all 2ℓ + 1
problems the inequality (24.32) is true with probability 1 − (2ℓ + 1)η. Therefore
with probability 1 − η simultaneously for all 2ℓ + 1 problems the inequality
'
s s s ln Nks − ln η + ln(2ℓ + 1)
R1 (Fi ) ≤ R2 (Fi ) + , Fr ∈ Sk (24.33)
ℓ−1
holds true, where the term ln(2ℓ + 1) is due to our combinatorial games with one
element of the working set. One can find an analogous bound for a combinatorial
game with k elements of the working set.
Combinatorial games allow one to introduce a very deep geometric concept of
equivalence classes (see (Vapnik, 2006, 1998) for details).
Figure 24.1 The large-margin hyperplane obtained using only the training set does not
belong to the largest equivalence class defined on the working set.
We have not yet discussed how to measure the size of equivalence classes. In this
section we will discuss two possibilities. We could either
1. use a measure that reflects the VC dimension concept for the set of linear (in a
feature space) indicator functions: the value of a margin for the equivalence class,
or
2. measure the size of equivalence classes using the most refined capacity concept:
the VC entropy.
Using the size of the margin for equivalence class. With the appearance
of support vector machines (SVMs) the important problem became the following:
given a working set of vectors (24.3) construct a structure on the equivalence classes
of linear functions.
Let us measure the size μ(Fi ) of an equivalence class Fi by the value of the
corresponding margin.9 Any equivalence class separates working vectors (24.3) into
two classes. Let us find among the functions belonging to the equivalence class one
that has the largest distance (margin) to the closest vector of the set (24.3). We
use this distance as the measure μ(Fi ) for the size of the equivalence class Fi . This
measure and how it differs from the SVM are illustrated in figure 24.1.
Using this concept of the size of an equivalence class the SVM transductive
algorithms were suggested Vapnik (2006).
9. There is a direct connection between the value of the margin and the VC dimension
defined on the set of equivalence classes (see (Vapnik, 1998, chapter 8)).
464 Transductive Inference and Semi-Supervised Learning
called the universum. Using the working set (24.3) we will create a set of equivalence
classes of functions, and using the universum (24.34) we will evaluate the size of
the equivalence classes.
The universum plays the role of prior information in Bayesian inference. It
describes our knowledge of the problem we are solving. There exist, however,
important differences between prior information in Bayesian inference and prior
information given by the universum. In Bayesian inference, prior information is
information about the relationship of the functions in the set of admissible functions
to the desired one. The universum is information about a relationship between the
working set and a set of possible problems. For example, for the digit recognition
problem it can be some vectors whose images resemble a digit. It defines a style of
digits for the recognition task.
Figure 24.2 The largest number of contradictions on the universum defines the largest
equivalence class.
One can translate the discussions of inductive and transductive methods of inference
into the following SVM algorithms. In SVM algorithms one first maps input vectors
x into vectors z of Hilbert space Z obtaining the images of the training data and
test data:
and then constructs the optimal separating hyperplane in the feature (Hilbert)
space.
466 Transductive Inference and Semi-Supervised Learning
Given the images (24.35) of the training data (24.1) construct the large-margin
linear decision rule (Vapnik, 1995):
where the vector w and threshold b are the solution of the following convex quadratic
optimization problem: Minimize the functional
ℓ
R(w) = (w, w) + C1 θ(ξi ), C1 ≥ 0 (24.37)
i=1
(defined by the images of the training data (24.35)) where we have denoted
1, if ξi > 0
θ(ξi ) = .
0, if ξi = 0
Given the images (24.35) of the training data (24.1), images (24.36) of the test data
(24.2), and the images
where the vector w and threshold b are the solution of the following convex quadratic
optimization problem: Minimize the functional
ℓ
u
R(w) = (w, w) + C1 θ(ξi ) + C2 θ(ξs∗ ), C1 , C2 ≥ 0 (24.40)
i=1 s=1
(defined by the images of the training data (24.35)) and the constraints
Given the images (24.35) of the training data (24.1) and the images (24.36) of
the test data (24.2) construct the large-margin linear decision rule for transductive
inference,
where the vector w and threshold b are the solution of the following optimization
problem: Minimize the functional
ℓ
ℓ+k
R(w) = (w, w) + C1 θ(ξi ) + C2 θ(ξj ), C1 , C2 ≥ 0 (24.43)
i=1 j=ℓ+1
(defined by the images (24.35) of the training data (24.1)) and the constraints
(defined by the images (24.36) of the test data (24.2)) and its desired classifictions
∗ ∗
yℓ+1 , ..., yℓ+k .
One more constraint. To avoid unbalanced solution, Capelle and Zien, following
ideas of Thorsten Joachims, suggested the following constraint (Chapelle and Zien,
2005)):
ℓ+k ℓ
1 1
((w, zj ) + b) ≈ yi . (24.46)
u ℓ i=1
j=ℓ+1
This constraint requires that the test data have about the same proportion of
vectors from the two classes as was observed for the training data.
Given the images (24.35) of the training data (24.1), the images (24.36) of the test
data (24.2), and the images (24.39) of the universum (24.34), construct the linear
decision rule
where the vector w and threshold b are the solution of the following optimization
problem: Minimize the functional
ℓ
ℓ+k
u
R(w) = (w, w)+C1 θ(ξi )+C2 θ(ξj )+C3 θ(ξs∗ ), C1 , C2 , C3 ≥ 0 (24.47)
i=1 j=ℓ+1 s=1
(defined by the images of the test data (24.36)) and its desired classification, and
the constraints
To simplify the optimization problems of the described algorithms the step function
θ(ξ) was replaced by the linear function ξ in the objective functionals (24.37),
(24.40), (24.43), and (24.47). Therefore the following algorithms were obtained
(Vapnik, 1995, 1998):
Large-margin inductive SVM:
Minimize the functional
ℓ
R(w) = (w, w) + C1 ξi , C1 ≥ 0 (24.51)
i=1
Table 24.1 Test errors of SVMs trained without and with universum.
subject to the constraints (24.44) and (24.45). One can also use hint (24.46).
Maximal contradictions on the universum transductive SVM:
Minimize the functional
ℓ
ℓ+k
u
R(w) = (w, w) + C1 ξi + C2 ξj + C3 ξs∗ , C1 , C2 , C3 ≥ 0 (24.54)
i=1 j=ℓ+1 s=1
subject to the constraints (24.48), (24.49), (24.50). One can also use hint (24.46).
In the summer of 2005, R. Collobert and J. Weston conducted the first experiments
on training SVMs with a universum. They demonstrated that
a. SVMs plus a universum can significantly improve performance even in the
inductive mode (C2 = 0 in inequality (24.54));
b. for small training sets it is very important how the universum is constructed.
For large sets it is less important.
Using the NIST database they discriminated digit 8 from digit 5 using a conven-
tional SVM and an SVM trained with three different universum environments.
Table 24.1 shows for different sizes of training data the performance of conven-
tional SVMs and the performance of SVMs trained using universums U1 , U2 , U3 . In
all cases the parameter a = .01, and the parameters C1 , C2 and the parameter of
Gaussian kernel were tuned using the tenfold cross-validation technique.
For these experiments three different universums (each containing 5000 examples)
were constructed as follows:
U1 Select random digits from the other classes (0,1,2,3,4,6,7,9).
U2 Creates an artificial image by first selecting a random 5 and a random 8,
(from pool of 3,000 non-test examples) and then for each pixel of the artificial
image choosing with probability 1/2 the corresponding pixel from the image 5 or
from the image 8.
U3 Creates an artificial image by first selecting a random 5 and a random 8,
(from pool of 3,000 non-test examples) and then constructing the mean of these
two digits.
470 Transductive Inference and Semi-Supervised Learning
There are two reasons to consider the transductive mode of inference as we have
described it above. The first reason is that it is an extremely useful tool for practical
applications (see Weston et al. (2003b) and chapter 6).
where given the training data and an unknown zip code the recognition of any
fixed digit of a zip code depends on recognition of the rest of the digits of the
zip code. That is, the rule is constructed for the specific zip code. For another zip
code one constructs another rule (which might reflect the adaptation to different
handwriting). One can find many such examples.
The second reason for considering transductive inference is that it forms the
simplest model of noninductive inference. These inferences are based on the same
general model as inductive inference: the SRM principle. The theory of transduction
describes (in the framework of the SRM principle) the mechanisms that provide the
advantage of transductive inference over inductive inference.
There also exist models of inferences that go beyond transduction. In particular,
selective inference:
Given ℓ training examples,
select among the u candidates the k vectors with the highest probability of
belonging to the first class. Examples of selective inference include:
Discovery of bioactive drugs: Given a training set (24.55) of bioactive and non-
bioactive drugs, select from the u candidates (24.56) the k representatives with the
highest probability of belonging to the bioactive group.
National security: Given training set (24.55) of terrorists and nonterrorists, select
from the u candidates (24.56) the k representatives with the highest probability of
belonging to the terrorist group.
Note that selective inference requires a less demanding solution than transductive
inference: it does not require classification of the most difficult (border) cases.
Selective inference is the basis for solving high-dimensional decision-making
problems. To analyze the selective inference problem one can use the same SRM
principle but with a different concept of equivalence classes.
472 Transductive Inference and Semi-Supervised Learning
A: Let me start by saying that to me, the topic of our discussion seems strange.
Rather than asking for the difference, we should ask what SSL1 and transduction
have in common, if anything. SSL is about how to use information contained in
unlabeled data which we have in addition to the labeled training set. Transduction,
on the other hand, claims that it is powerful because it is solving a simpler task
than inductive learning.
A: But couldn’t you easily build an inductive algorithm from a transductive one
by carrying out the following procedure? For all possible test inputs x: add x as a
single unlabeled point to the labeled training set, and use the transductive algorithm
to predict the corresponding output. This gives you a mapping from x to y, in other
words, a function, just like any inductive algorithm. So a transductive solution
implies an inductive one, and thus transduction is no easier than induction.
B: As soon as we have more than one unlabeled point, this argument fails.
Nevertheless, in order to retain the distinction between induction and transduction,
we may want to exclude the situation. Whatever it is called, even the case of one
unlabeled point is interesting: it could be viewed as induction with a function class
which is not given explicitly.
Transduction works because the test set can give you a nontrivial factorization
of the function class. Let us call two functions equivalent if they cannot be
distinguished based on any of the given training or test examples. It is then sufficient
to use only one representative of each equivalence class, and forget about all other
functions. Our function class is effectively finite, and we can directly write down a
generalization error bound.
By the way, the size of the equivalence classes is important for generalization: I
believe that functions from large classes generalize better. Think of the notion of a
margin: if you have a large margin of separation between two classes of data, then
there are usually many different functions that fit into this margin, and correctly
separate the data (and thus are equivalent on the data).
C: This seems an interesting point. You said that one point is not enough for
transduction — how about for SSL? Would one unlabeled point be of any use?
C: But surely, the factorization of the function class which you talked about before
will also depend on P (x)?
B: Yes.
2. See chapter 1.
A Discussion of Semi-Supervised Learning and Transduction 475
A: ... and I would argue that one notion that captures whether the data are benign
is the semi-supervised smoothness assumption. This also makes the connection with
the margin, since large-margin separation is low-density separation.
C: And what happens toward the other extreme of an infinite number of points?
A: This seems to show that transduction relies on the same kind of assumptions
as SSL. And, for increasing amounts of unlabeled points, SSL also converges to
induction plus knowledge of P (x). So where is the difference? In the limit of
infinitely many unlabeled points, transduction cannot be easier than inductive SSL.
B: In the real world, we do not have infinitely many data points. Anyway, my
point of view is more fundamental. It is based on what is behind the VC bounds for
induction. To prove these bounds, one uses the symmetrization lemma — we upper-
bound the difference between the error on the training set and the expected error
by the error on the training sample and the error on a second sample — the ghost
sample. This is exact transduction; it is a statement about the error on a given set
of points. But the VC bounds for induction then have to take an expectation with
respect to the unknown points, or even a supremum over the choice of the points.
This is much worse than what one can do knowing the points.3
B: There might exist distributions for which transduction can give worse results
than induction.
A: If I try to sum up the arguments of B, there are two different reasons why
transduction can be useful. The first one is that the bounds for transduction are
tighter than the bounds for induction, and the second one is that measuring the size
of the equivalence classes is an opportunity to change the ordering in the structure
of our class of functions. This second reason seems closely related to the motivations
in SSL.
A: I surely agree that the two notions are orthogonal, but for different reasons.
To make my point clear let me consider two sets. One of them is a set of unlabeled
data which we have for training. I don’t care about the predictions on this set, I
only care about how to use the information this set provides about P (x). So I need
to assume that this set actually comes from P (x), or at least from a distribution
that is related to P (x) in some way. The other set is the actual test set. I do not care
where it comes from; it could be anything. In my view, a transductive algorithm
is one whose solution depends on the test points that I am given. The opposite of
a transductive algorithm is an inductive one. A semi-supervised algorithm, on the
other hand, is one that depends on the unlabeled set (as opposed to a supervised
algorithm). It does not care which test points are used in the end to evaluate its
performance.
B: This does not make sense to me. The test points need to be meaningful.
Transduction is intrinsically simpler than induction: it does not make predictions
for arbitrary test points.
A Discussion of Semi-Supervised Learning and Transduction 477
B: Indeed, the philosophy is similar, since in both cases one solves a simpler
problem. However, local learning is still inductive because there exists an implicit
decision function, even though it is never explicitely constructed. The concept of
local learning is actually almost the same as transduction with one test point which
we were talking about earlier.
C: This local learning idea might also be present in TSVM. Indeed, I can see an
advantage in using as unlabeled points the test points rather than an arbitrary set
of unlabeled points: by doing so, the algorithm concentrates in the regions of the
space where it is important to be accurate, as in local learning.5
A: The way I view them, transductive algorithms can also be designed for
computational reasons. Take, for instance, the Bayesian committee machine. 6 The
solution returned by this algorithm is an expansion on a set of basis functions.
But for computational efficiency, only basis functions centered at the test points
are considered. So the solution will depend on the test set and the algorithm is
transductive according to my definition...
C: ... but not according to the definition of B, since for this algorithm the test
points can be arbitrary.
A: And this shows why transductive methods are always semi-supervised: they
use information contained in the test points. Otherwise there would be no reason
not to consider arbitrary test points.
C: This is an interesting example. But it seems different from the standard i.i.d.
framework: in this case, if viewed as drawn from the distribution of all possible
digits, the test points are dependent, because they have been written by the same
person.
A: I do not think we have resolved the question we were asking. Read chapter 25
of Chapelle et al. (2006) and the references therein, and you will understand what
I mean.
References
B. Abboud, F. Davoine, and M. Dang. Expressive face recognition and synthesis. In Computer
Vision and Pattern Recognition Workshop, volume 5, page 54, 2003.
N. Abe, J. Takeuchi, and M. Warmuth. Polynomial learnability of stochastic rules with respect to
the KL-divergence and quadratic distance. IEICE Transactions on Information and Systems,
E84-D(3):299–316, March 2001.
Y. S. Abu-Mostafa. Machines that learn from hints. Scientific American, 272(4):64–69, 1995.
A. K. Agrawala. Learning with a probabilistic teacher. IEEE Transactions on Information Theory,
16:373–379, 1970.
A. Agresti. Categorical Data Analysis. Wiley, Hoboken, NJ, 2002.
H. Akaike. Statistical predictor information. Annals of the Institute of Statistical Mathematics,
22:203–271, 1970.
H. Akaike. A new look at the statistical model identification. IEEE Transactions on Automatic
Control, 19:716–723, 1974.
B. Alberts, D. Bray, A. Johnson, J. Lewis, M. Raff, K. Roberts, and P. Walter. Essential Cell
Biology: An Introduction to the Molecular Biology of the Cell. New York, Garland Science
Publishing, 1998.
S. F. Altschul, W. Gish, W. Miller, E. W. Myers, and D. J. Lipman. A basic local alignment
search tool. Journal of Molecular Biology, 215:403–410, 1990.
S. F. Altschul, T. L. Madden, A. A. Schaffer, J. Zhang, Z. Zhang, W. Miller, and D. J. Lipman.
Gapped BLAST and PSI-BLAST: A new generation of protein database search programs.
Nucleic Acids Research, 25:3389–3402, 1997.
Y. Altun, D. McAllester, and M. Belkin. Maximum margin semi-supervised learning for structured
variables. In Advances in Neural Information Processing Systems, volume 18, 2005.
M. R. Amini and P. Gallinari. Semi-supervised logistic regression. In Fifteenth European
Conference on Artificial Intelligence, pages 390–394, 2002.
J. A. Anderson. Multivariate logistic compounds. Biometrika, 66:17–26, 1979.
M. Anjos. New Convex Relaxations for the Maximum Cut and VLSI Layout Problems. Phd
thesis, Waterloo University, Waterloo, Canada, 2001.
F. Bach, G. Lanckriet, and M. Jordan. Multiple kernel learning, conic duality, and the SMO
algorithm. In Proceedings of the Twenty-first International Conference on Machine Learning,
New York, 2004. ACM Press.
R. Baeza-Yates and B. Ribeiro-Neto. Modern Information Retrieval. ACM Press, New York,
1999.
M.-F. Balcan and A. Blum. A PAC-style model for learning from labeled and unlabeled data. In
Conference on Computational Learning Theory, pages 111–126, 2005.
M.-F. Balcan, A. Blum, and K. Yang. Co-training and expansion: Towards bridging theory and
practice. In Advances in Neural Information Processing Systems, 2004.
S. Baluja. Probabilistic modeling for face orientation discrimination: Learning from labeled and
unlabeled examples. In Advances in Neural Information Processing Systems 11, pages 854–860,
1999.
A. Banerjee, I. Dhillon, J. Ghosh, and S. Sra. Clustering on the unit hypersphere using von
Mises-Fisher distributions. Journal of Machine Learning Research, 6:1345–1382, 2005a.
A. Banerjee, S. Merugu, I. Dhilon, and J. Ghosh. Clustering with Bregman divergences. Journal
of Machine Learning Research, 6:1705–1749, Oct 2005b.
480 REFERENCES
N. Bansal, A. L. Blum, and S. Chawla. Correlation clustering. In The 43rd Annual IEEE
Symposium on Foundations of Computer Science, pages 238–247, 2002.
A. Bar-Hillel, T. Hertz, N. Shental, and D. Weinshall. Learning distance functions using equiv-
alence relations. In Proceedings of the International Conference on Machine Learning, pages
11–18, Washington, DC, 2003.
P. Bartlett and S. Mendelson. Rademacher and Gaussian complexities risk bounds and structural
results. Journal of Machine Learning Research, 3:463–482, 2002.
S. Basu, A. Banerjee, and R. J. Mooney. Semi-supervised clustering by seeding. In Proceedings
of the International Conference on Machine Learning, pages 19–26, 2002.
S. Basu, A. Banerjee, and R. J. Mooney. Active semi-supervision for pairwise constrained
clustering. In Proceedings of the SIAM International Conference on Data Mining, 2004a.
S. Basu, M. Bilenko, and R. J. Mooney. A probabilistic framework for semi-supervised clustering.
In Proceedings of the tenth ACM SIGKDD international conference on knowledge discovery
and data mining, pages 59–68, Seattle, WA, 2004b.
E. B. Baum. Polynomial time algorithms for learning neural nets. In Proceedings of the Third
Annual Workshop on Computational Learning Theory, pages 258 – 272, 1990.
S. Becker and G. E. Hinton. A self-organizing neural network that discovers surfaces in random-dot
stereograms. Nature, 355:161–163, 1992.
M. Belkin. Problems of Learning on Manifolds. PhD thesis, Department of Mathematics,
University of Chicago, 2003.
M. Belkin, I. Matveeva, and P. Niyogi. Regression and regularization on large graphs. In
Proceedings of the Seventeenth Annual Conference on Learning Theory, 2004a.
M. Belkin, I. Matveeva, and P. Niyogi. Regularization and semi-supervised learning on large
graphs. In Proceedings of the Seventeenth Annual Conference on Computational Learning
Theory, pages 624–638, Banff, Canada, 2004b.
M. Belkin and P. Niyogi. Semi-supervised learning on manifolds. In Advances in Neural
Information Processing Systems, 2002.
M. Belkin and P. Niyogi. Laplacian eigenmaps for dimensionality reduction and data representa-
tion. Neural Computation, 15(6):1373–1396, 2003a.
M. Belkin and P. Niyogi. Using manifold structure for partially labeled classification. In S. Becker,
S. Thrun, and K. Obermayer, editors, Advances in Neural Information Processing Systems 15,
Cambridge, MA, 2003b. MIT Press.
M. Belkin, P. Niyogi, and V. Sindhwani. Manifold regularization: A geometric framework for
learning from examples. Technical Report TR-2004-06, University of Chicago, 2004c.
M. Belkin, P. Niyogi, and V. Sindhwani. On manifold regularization. In R. G. Cowell and
Z. Ghahramani, editors, Proceedings of the Tenth International Workshop on Artificial Intel-
ligence and Statistics, pages 17–24. Society for Artificial Intelligence and Statistics, 2005.
R. Bellman. Adaptive Control Processes: A Guided Tour. Princeton University Press, Princeton,
NJ, 1961.
S. Ben-David, A. Itai, and E. Kushilevitz. Learning by distances. Information and Computation,
117(2):240–250, 1995.
A. Ben-Hur and D. Brutlag. Remote homology detection: A motif based approach. In Proceedings
of the Seventh International Conference on Intelligent Systems for Molecular Biology, 2003.
G. M. Benedek and A. Itai. Learnability with respect to a fixed distribution. Theoretical Computer
Science, 86:377–389, 1991.
Y. Bengio and N. Chapados. Extensions to metric based model selection. Journal of Machine
Learning Research, 3:1209–1227, 2003.
Y. Bengio, O. Delalleau, and N. Le Roux. The curse of dimensionality for local kernel machines.
Technical Report 1258, Département d’informatique et recherche opérationnelle, Université de
Montréal, 2005.
Y. Bengio, O. Delalleau, and N. Le Roux. The curse of highly variable functions for local kernel
machines. In Advances in Neural Information Processing Systems 18. MIT Press, Cambridge,
MA, 2006a.
Y. Bengio, O. Delalleau, N. Le Roux, J.-F. Paiement, P. Vincent, and M. Ouimet. Learning
eigenfunctions links spectral embedding and kernel PCA. Neural Computation, 16(10):2197–
REFERENCES 481
2219, 2004a.
Y. Bengio, H. Larochelle, and P. Vincent. Non-local manifold Parzen windows. In Advances in
Neural Information Processing Systems 18. MIT Press, Cambridge, MA, 2006b.
Y. Bengio and M. Monperrus. Non-local manifold tangent learning. In L.K. Saul, Y. Weiss, and
L. Bottou, editors, Advances in Neural Information Processing Systems 17, Cambridge, MA,
2005. MIT Press.
Y. Bengio, J.-F. Paiement, P. Vincent, O. Delalleau, N. Le Roux, and M. Ouimet. Out-of-sample
extensions for lle, isomap, MDS, eigenmaps, and spectral clustering. In S. Thrun, L. Saul,
and B. Schölkopf, editors, Advances in Neural Information Processing Systems 16. MIT Press,
Cambridge, MA, 2004b.
K. P. Bennett and A. Demiriz. Semi-supervised support vector machines. In M. S. Kearns, S. A.
Solla, and D. A. Cohn, editors, Advances in Neural Information Processing Systems, volume 11,
pages 368–374, Cambridge, MA, 1999. MIT Press.
K. P. Bennett, A. Demiriz, and R. Maclin. Exploiting unlabeled data in ensemble methods. In
Proceedings of the eighth ACM SIGKDD international conference on knowledge discovery and
data mining, 2002.
J. O. Berger. Statistical Decision Theory and Bayesian Analysis. Springer-Verlag, New York, 2nd
edition, 1985.
R. H. Berk. Limiting behavior of posterior distributions when the model is incorrect. Annals of
Mathematical Statistics, pages 51–58, 1966.
M. Bernstein, V. de Silva, J. C. Langford, and J. B. Tenenbaum. Graph approximations to
geodesics on embedded manifolds. Technical report, Stanford University, Stanford, December
2000.
J. Besag. On the statistical analysis of dirty pictures. Journal of the Royal Statistical Society,
Series B (Methodological), 48(3):259–302, 1986.
A. Beygelzimer, S. Kakade, and J. Langford. Cover trees for nearest neighbor. Sumitted for
publication, 2004.
M. Bilenko and S. Basu. A comparison of inference techniques for semi-supervised clustering
with hidden Markov random fields. In Proceedings of the ICML-2004 Workshop on Statistical
Relational Learning and its Connections to Other Fields, Banff, Canada, 2004.
M. Bilenko, S. Basu, and R. J. Mooney. Integrating constraints and metric learning in semi-
supervised clustering. In Proceedings of the International Conference on Machine Learning,
pages 81–88, Banff, Canada, 2004.
M. Bilenko and R. J. Mooney. Adaptive duplicate detection using learnable string similarity
measures. In Proceedings of the ninth ACM SIGKDD international conference on knowledge
discovery and data mining, pages 39–48, Washington, DC, 2003.
C. Bishop. Neural Networks for Pattern Recognition. Clarendon Press, Oxford, 1995.
R. E. Blahut. Computation of channel capacity and rate distortion functions. In IEEE Transac-
tions on Information Theory, volume 18, pages 460–473, July 1972.
A. Blum and S. Chawla. Learning from labeled and unlabeled data using graph mincuts. In
Proceedings of the Eighteenth International Conference on Machine Learning, pages 19–26,
2001.
A. Blum, A. Frieze, R. Kannan, and S. Vempala. A polynomial-time algorithm for learning noisy
linear threshold functions. Algorithmica, 22:35–52, 1998.
A. Blum and R. Kannan. Learning an intersection of k halfspaces over a uniform distribution.
Journal of Computer and Systems Sciences, 54(2):371–380, 1997.
A. Blum, J. Lafferty, M. Rwebangira, and R. Reddy. Semi-supervised learning using randomized
mincuts. In International Conference on Machine Learning, 2004.
A. Blum and J. C. Langford. PAC-MDL bounds. In Conference on Computational Learning
Theory, 2003.
A. Blum and T. Mitchell. Combining labeled and unlabeled data with co-training. In Proceedings
of the Eleventh Annual Conference on Computational Learning Theory, pages 92–100, 1998.
A. Blumer, A. Ehrenfeucht, D. Haussler, and M. K. Warmuth. Learnability and the Vapnik
Chervonenkis dimension. Journal of the ACM, 36(4):929–965, 1989.
R. Board and L. Pitt. Semi-supervised learning. Machine Learning, 4(1):41–65, 1989.
482 REFERENCES
B. E. Boser, I. M. Guyon, and V. N. Vapnik. A traininig algorithm for optimal margin classifiers.
In D. Haussler, editor, Proceedings of the Fifth Annual ACM Workshop on Computational
Learning Theory, pages 144–152, 1992.
L. Bottou and V. Vapnik. Local learning algorithms. Neural Computation, 4(6):888–900, 1992.
S. Boucheron, O. Bousquet, and G. Lugosi. Theory of classification: a survey of some recent
advances. ESAIM: Probability and Statistics, 9:323–375, November 2005.
S. Boucheron, G. Lugosi, and P. Massart. A sharp concentration inequality with applications.
Random Structures and Algorithms, 16:277–292, 2000.
O. Bousquet, O. Chapelle, and M. Hein. Measure based regularization. In Advances in Neural
Information Processing Systems 16. MIT Press, Cambridge, MA, 2004.
O. Bousquet and D. Herrmann. On the complexity of learning the kernel matrix. In Advances in
Neural Information Processing Systems, volume 14, 2002.
S. Boyd and L. Vandenberghe. Convex Optimization. Cambridge University Press, Cambridge
UK, 2004.
M. Brand. Structure learning in conditional probability models via an entropic prior and parameter
extinction. Neural Computation, 11(5):1155–1182, 1999.
M. Brand. Nonlinear dimensionality reduction by kernel eigenmaps. In International Joint
Conference on Artificial Intelligence, 2003.
M. Brand. From subspaces to submanifolds. In Proceedings of the British Machine Vision
Conference, London, 2004.
L. Breiman. Bagging predictors. Machine Learning, 24:123–40, 1996.
S. E. Brenner, P. Koehl, and M. Levitt. The ASTRAL compendium for sequence and structure
analysis. Nucleic Acids Research, 28:254–256, 2000.
R. Bruce. Semi-supervised learning using prior probabilities and EM. In IJCAI-01 Workshop on
Text Learning: Beyond Supervision, August 2001.
W. L. Buntine. Operations for learning with graphical models. Journal of Artificial Intelligence
Research, 2:159–225, 1994.
C. J. C. Burges. Geometric methods for feature extraction and dimensional reduction. In L. Rokach
and O. Maimon, editors, Data Mining and Knowledge Discovery Handbook: A Complete Guide
for Practitioners and Researchers. Kluwer, Dordrecht, the Netherlands, 2005.
V. Castelli. The Relative Value of Labeled and Unlabeled Samples in Pattern Recognition. PhD
thesis, Stanford University, Stanford, CA, December 1994.
V. Castelli and T. M. Cover. On the exponential value of labeled samples. Pattern Recognition
Letters, 16:105–111, 1995.
V. Castelli and T. M. Cover. The relative value of labeled and unlabeled samples in pattern
recognition with an unknown mixing parameter. IEEE Transactions on Information Theory,
42(6):2102–2117, November 1996.
G. Celeux and G. Govaert. A classification EM algorithm for clustering and two stochastic versions.
Computational Statistics & Data Analysis, 14(3):315–332, 1992.
O. Chapelle, B. Schölkopf, and A. Zien, editors. Semi-Supervised Learning. The MIT Press, 2006.
O. Chapelle and V. Vapnik. Model selection for support vector machines. In Advances in Neural
Information Processing Systems, volume 12, 2000.
O. Chapelle, V. Vapnik, and Y. Bengio. Model selection for small sample regression. Machine
Learning, 48(1-3):9–23, 2002.
O. Chapelle, V. Vapnik, and J. Weston. Transductive inference for estimating values of functions.
In Advances in Neural Information Processing Systems, 1999.
O. Chapelle, J. Weston, L. Bottou, and V. Vapnik. Vicinal risk minimization. In T. K. Leen,
T. G. Dietterich, and V. Tresp, editors, Advances in Neural Information Processing Systems
13, pages 416–422, Cambridge, MA, 2001. MIT Press.
O. Chapelle, J. Weston, and B. Schölkopf. Cluster kernels for semi-supervised learning. In
S. Becker, S. Thrun, and K. Obermayer, editors, Advances in Neural Information Processing
Systems 15, pages 585–592, Cambridge, MA, 2003. MIT Press.
O. Chapelle and A. Zien. Semi-supervised classification by low density separation. In Tenth
International Workshop on Artificial Intelligence and Statistics, pages 57–64, 2005.
REFERENCES 483
J. Cheng, D. Bell, and W. Liu. Learning belief networks from data: An information theory based
approach. In International Conference on Information and Knowledge Management, pages
325–331, 1997.
V. Cherkassky and F. Mulier. Learning from Data: Concepts, Theory, and Methods. Wiley, New
York, 1998.
V. Cherkassky, F. Mulier, and V. Vapnik. Comparison of VC-method with classical methods for
model selection. In Proceedings World Congress on Neural Networks, pages 957–962, 1997.
F. R. K. Chung. Spectral Graph Theory. Number 92 in Regional Conference Series in Mathematics.
American Mathematical Society, Providence, RI, 1997.
I. Cohen, F. Cozman, N. Sebe, M. C. Cirelo, and T. Huang. Semisupervised learning of classifiers:
Theory, algorithms, and their application to human-computer interaction. IEEE Transactions
on Pattern Analysis and Machine Intelligence, 26(12):1553–1568, 2004.
I. Cohen, N. Sebe, F. G. Cozman, M. C. Cirelo, and T. S. Huang. Learning Bayesian network
classifiers for facial expression recognition using both labeled and unlabeled data. In IEEE
Conference on Computer Vision and Pattern Recognition, 2003.
D. Cohn, R. Caruana, and A. McCallum. Semi-supervised clustering with user feedback. Technical
Report TR2003-1892, Cornell University, Ithaca, NY, 2003.
R. R. Coifman, S. Lafon, A. B. Lee, M. Maggioni, B. Nadler, F. Warner, and S. W. Zucke.
Geometric diffusions as a tool for harmonic analysis and structure definition of data: Diffusion
maps. Proceedings of the National Academy of Sciences, 102:7426–7431, 2005.
M. Collins and Y. Singer. Unsupervised models for named entity classification. In Proceedings
of the Joint SIGDAT Conference on Empirical Methods in Natural Language Processing and
Very Large Corpora, pages 189–196, 1999.
R. Collobert, F. Sinz, J. Weston, and L. Bottou. Large-scale transductive SVMs. Journal of
Machine Learning Research, 2006. In press. http://www.kyb.tuebingen.mpg.de/bs/people/
fabee/transduction.html.
D. B. Cooper and J. H. Freeman. On the asymptotic improvement in the outcome of supervised
learning provided by additional nonsupervised learning. IEEE Transactions on Computers,
C-19(11):1055–1063, November 1970.
A. Corduneanu and T. Jaakkola. Continuation methods for mixing heterogeneous sources. In
Proceedings of the Eighteenth Annual Conference on Uncertainty in Artificial Intelligence,
2002.
A. Corduneanu and T. Jaakkola. On information regularization. In Proceedings of the Nineteenth
conference on Uncertainty in Artificial Intelligence, 2003.
A. Corduneanu and T. Jaakkola. Distributed information regularization on graphs. In Advances
in Neural Information Processing Systems 17, 2004.
C. Cortes, P. Haffner, and M. Mohri. Rational kernels. Neural Information Processing Systems
15, 2002.
C. Cortes and V. N. Vapnik. Support–vector networks. Machine Learning Journal, 20:273–297,
1995.
T. Cover and J. Thomas. Elements of Information Theory. Wiley, New York, 1991.
T. Cox and M. Cox. Multidimensional Scaling. Chapman & Hall, London, 1994.
F. G. Cozman and I. Cohen. Unlabeled data can degrade classification performance of generative
classifiers. In Proceedings of the Fifteenth International Florida Artificial Intelligence Research
Society Conference, pages 327–331, Pensacola, FL, 2002.
F. G. Cozman, I. Cohen, and M. C. Cirelo. Semi-supervised learning and model search. In
Proceedings of the ICML-2003 Workshop: The Continuun from Labeled to Unlabeled Data in
Machine Learning and Data Mining, pages 111–112, 2003a.
F. G. Cozman, I. Cohen, and M. C. Cirelo. Semi-supervised learning of mixture models. In
International Conference on Machine Learning, pages 99–106, 2003b.
M. Craven, D. DiPasquo, D. Freitag, A. McCallum, T. Mitchell, K. Nigam, and S. Slattery.
Learning to construct knowledge bases from the World Wide Web. Artificial Intelligence, 118
(1–2):69–113, 2000.
P. Craven and G. Wahba. Smoothing noisy data with spline functions. Numerische Mathematik,
31:377–403, 1979.
484 REFERENCES
P. Derbeko, R. El-Yaniv, and R. Meir. Error bounds for transductive learning via compression
and clustering. In Advances in Neural Information Processing Systems, pages 1085–1092. MIT
Press, Cambridge, MA, 2003.
L. Devroye, L. Györfi, and G. Lugosi. A Probabilistic Theory of Pattern Recognition, volume 31
of Applications of Mathematics. Springer-Verlag, New York, 1996.
I. S. Dhillon and Y. Guan. Information theoretic clustering of sparse co-occurrence data. In Third
IEEE International Conference on Data Mining, pages 517–521, 2003.
I. S. Dhillon and D. S. Modha. Concept decompositions for large sparse text data using clustering.
Machine Learning, 42:143–175, 2001.
T. G. Dietterich. Approximate statistical tests for comparing supervised classification learning
algorithms. Neural Computation, 10(7):1895–1924, 1998.
B. E. Dom. An information-theoretic external cluster-validity measure. Research Report RJ
10219, IBM, 2001.
P. Domingos and M. Pazzani. On the optimality of the simple Bayesian classifier under zero-one
loss. Machine Learning, 29(2/3):103–130, 1997.
D. L. Donoho and C. E. Grimes. When does Isomap recover the natural parameterization of
families of articulated images? Technical Report 2002-27, Department of Statistics, Stanford
University, Stanford, CA, August 2002.
D. L. Donoho and C. E. Grimes. Hessian eigenmaps: Locally linear embedding techniques for high-
dimensional data. Proceedings of the National Academy of Arts and Sciences, 100:5591–5596,
2003.
P. G. Doyle and J. L. Snell. Random walks and electric networks. Mathematical Association of
America, 1984.
H. Drucker, D. Wu, and V. Vapnik. Support vector machines for spam categorization. IEEE
Transactions on Neural Networks, 10(5):1048–1054, 1999.
S. T. Dumais, J. Platt, D. Heckerman, and M. Sahami. Inductive learning algorithms and
representations for text categorization. In Proceedings of the ACM International Conference
on Information and Knowledge Management, pages 148–155, 1998.
J. Dunagan and S. Vempala. Optimal outlier removal in high-dimensional spaces. In Proceedings
of the Thirty-third ACM Symposium on Theory of Computing, 2001.
B. Efron. The efficiency of logistic regression compared to normal discriminant analysis. Journal
of the American Statistical Association, 70(352):892–898, 1975.
B. Efron. Computers and the theory of statistics: Thinking the unthinkable. SIAM Review, 21:
460–480, 1979.
A. Ehrenfeucht, D. Haussler, M. Kearns, and L. Valiant. A general lower bound on the number
of examples needed for learning. Information and Computation, 82:246–261, 1989.
B. Fischer, V. Roth, and J. M. Buhmann. Clustering with the connectivity kernel. In Advances
in Neural Information Processing Systems 16, 2004.
A. Flaxman, 2003. Personal communication.
D. Foster and E. George. The risk inflation criterion for multiple regression. Annals of Statistics,
22:1947–1975, 1994.
S. C. Fralick. Learning to recognize patterns without a teacher. IEEE Transactions on Information
Theory, 13:57–64, 1967.
Y. Freund and R. Schapire. A decision-theoretic generalization of on-line learning and an
application to boosting. Journal of Computer and System Sciences, 55(1):119–139, 1997.
J. H. Friedman. On bias, variance, 0/1-loss, and the curse-of-dimensionality. Data Mining and
Knowledge Discovery, 1(1):55–77, 1997.
J. H. Friedman, J. L. Bentley, and R. A. Finkel. An algorithm for finding best matches in
logarithmic expected time. ACM Transactions on Mathematical Software, 3:209–226, 1977.
J. H. Friedman, T. Hastie, and R. Tibshirani. Additive logistic regression: A statistical view of
boosting. Annals of Statistics, 28(2):337–407, 2000.
N. Friedman. The Bayesian structural EM algorithm. In Proceedings of the Conference on
Uncertainty in Artificial Intelligence, pages 129–138, 1998.
N. Friedman, D. Geiger, and M. Goldszmidt. Bayesian network classifiers. Machine Learning, 29:
131–163, 1997.
486 REFERENCES
G. Fung and O. Mangasarian. Semi-supervised support vector machines for unlabeled data
classification. Optimization Methods and Software, 15:29–44, 2001.
C. Galarza, E. Rietman, and V. Vapnik. Applications of model selection techniques to polynomial
approximation. Preprint, 1996.
A. Gammerman, V. Vapnik, and V. Vowk. Learning by transduction. In Conference on
Uncertainty in Artificial Intelligence, pages 148–156, 1998.
S. Ganesalingam. Classification and mixture approaches to clustering via maximum likelihood.
Applied Statistics, 38(3):455–466, 1989.
S. Ganesalingam and G. McLachlan. The efficiency of a linear discriminant function based on
unclassified initial samples. Biometrika, 65:658–662, 1978.
S. Ganesalingam and G. McLachlan. Small sample results for a linear discriminant function
estimated from a mixture of normal populations. Journal of Statistical Computation and
Simulation, 9:151–158, 1979.
A. Garg and D. Roth. Understanding probabilistic classifiers. In Proceedings of the 12th European
Conference on Machine Learning, pages 179–191, 2001.
S. Geman, E. Bienenstock, and R. Doursat. Neural networks and the bias/variance dilemma.
Neural Computation, 4(1):1–58, 1992.
S. Geman and D. Geman. Stochastic relaxation, Gibbs distributions and the Bayesian restoration
of images. IEEE Transactions on Pattern Analysis and Machine Intelligence, 6:721–742, 1984.
R. Ghani. Combining labeled and unlabeled data for text classification with a large number of
categories. In Proceedings of the IEEE International Conference on Data Mining, 2001.
R. Ghani. Combining labeled and unlabeled data for multiclass text categorization. In Proceedings
of the International Conference on Machine Learning, 2002.
E. Giné and A. Guillou. Rates of strong uniform consistency for multivariate kernel density
estimators. Annales de l’Institut Henri Poincaré (B) Probability and Statistics, 38(6):907–921,
November 2002.
L. Goldstein and K. Messer. Optimal plug-in estimators for nonparametric functional estimation.
Annals of Statistics, 20(3):1306–1328, 1992.
G. H. Golub and C. F. Van Loan. Matrix Computations. Johns Hopkins University Press,
Baltimore, 3rd edition, 1996.
C. Goutte, H. Déjean, E. Gaussier, J.-M. Renders, and N. Cancedda. Combining labelled and
unlabelled data: A case study on Fisher kernels and transductive inference for biological entity
recognition. In Conference on Natural Language Learning, 2002.
T. Graepel, R. Herbrich, and K. Obermayer. Bayesian transduction. In Advances in Neural
Information System Processing, volume 12, 2000.
Y. Grandvalet. Logistic regression for partial labels. In Ninth Information Processing and
Management of Uncertainty in Knowledge-based Systems, pages 1935–1941, 2002.
Y. Grandvalet and Y. Bengio. Semi-supervised learning by entropy minimization. In Advances in
Neural Information Processing Systems, volume 17, 2004.
A. G. Gray and A. W. Moore. N-Body problems in statistical learning. In T. K. Leen, T. G.
Dietterich, and V. Tresp, editors, Advances in Neural Information Processing Systems 13, pages
521–527, Cambridge, MA, 2001. MIT Press.
M. Gribskov and N. L. Robinson. Use of receiver operating characteristic (ROC) analysis to
evaluate sequence matching. Computers and Chemistry, 20(1):25–33, 1996.
L. Hagen and A. B. Kahng. New spectral methods for ratio cut partitioning and clustering. IEEE.
Transactions on Computed Aided Desgin, 11:1074–1085, 1992.
J. Ham, D. D. Lee, S. Mika, and B. Schölkopf. A kernel view of the dimensionality reduction of
manifolds. In Proceedings of the Twenty-first International Conference on Machine Learning,
pages 369–376, Banff, Canada, 2004.
J. M. Hammersley and P. Clifford. Markov fields on finite graphs and lattices. Unpublished
manuscript, 1971.
J. A. Hanley and B. J. McNeil. The meaning and use of the area under a receiver operating
characteristic (ROC) curve. Radiology, 143:29–36, 1982.
W. Härdle, M. Müller, S. Sperlich, and A. Werwatz. Nonparametric and Semiparametric Models.
Springer-Verlag, Berlin, 2004. URL http://www.xplore-stat.de/ebooks/ebooks.html.
REFERENCES 487
R. Hardt and F. H. Lin. Mappings minimizing the Lp norm of the gradient. Communications on
Pure and Applied Mathematics, 40:556–588, 1987.
H. O. Hartley and J. N. K. Rao. Classification and estimation in analysis of variance problems.
Review of International Statistical Institute, 36:141–147, 1968.
T. Hastie and R. Tibshirani. Generalized Additive Models. Chapman and Hall, New York, 1990.
T. Hastie, R. Tibshirani, and J. Friedman. The Elements of Statistical Learning. Springer Series
in Statistics. Springer-Verlag, New York, 2001.
D. Haussler. Convolution kernels on discrete structures. Technical Report UCSC-CRL-99-10,
University of California, Santa Cruz, Santa Cruz, CA, July 1999.
D. Heckerman, D. M. Chickering, C. Meek, R. Rounthwaite, and C. Kadie. Dependency networks
for inference, collaborative filtering, and data visualization. Journal of Machine Learning
Research, 1:49–75, 2001.
M. Hein, J.-Y. Audibert, and U. von Luxburg. From graphs to manifolds - weak and strong
pointwise consistency of graph Laplacians. In Proceedings of the Eighteenth Conference on
Learning Theory, pages 470–485, 2005.
M. Hein and Y. Audibert. Intrinsic dimensionality estimation of submanifolds in Rd . Proceedings
of the Twenty-second International Conference on Machine Learning, pages 289 – 296, 2005.
J. Heinonen, T. Kilpeläinen, and O. Martio. Nonlinear Potential Theory of Degenerate Elliptic
Equations. Oxford University Press, Oxford, 1993.
C. Helmberg. Semidefinite programming for combinatorial optimization. Habilitationsschrift ZIB-
Report ZR-00-34, TU Berlin, Konrad-Zuse-Zentrum Berlin, 2000.
H. Hishigaki, K. Nakai, T. Ono, A. Tanigaki, and T. Takagi. Assessment of prediction accuracy
of protein function from protein-protein interaction data. Yeast, 18:523–531, 2001.
D. S. Hochbaum and D. B. Shmoys. A best possible heuristic for the k-center problem. Mathe-
matics of Operations Research, 10(2):180–184, 1985.
T. Hofmann and J. Puzicha. Statistical models for co-occurrence data. Technical Report AI Memo
1625, Artificial Intelligence Laboratory, MIT, Cambridge, MA, February 1998.
R. A. Horn and C. R. Johnson. Matrix Analysis. Cambridge University Press, Cambridge, UK,
1985.
D. W. Hosmer. A comparison of iterative maximum likelihood estimates of the parameters of
a mixture of two normal distributions under three different types of sample. Biometrics, 29:
761–770, December 1973.
P. J. Huber. The behavior of maximum likelihood estimates under nonstandard conditions. In
Proceedings of the Fifth Berkeley Symposium in Mathematical Statistics and Probability, pages
221–233. University of California Press, Berkeley, 1967.
E. Ie, J. Weston, W. S. Noble, and C. Leslie. Multi-class protein fold recognition using adaptive
codes. In Proceedings of the International Conference on Machine Learning, 2005.
J. Ihmels, G. Friedlander, S. Bergmann, O. Sarig, Y. Ziv, and N. Barkai. Revealing modular
organization in the yeast transcriptional network. Nature Genetics, 31:370–377, 2002.
T. Jaakkola, M. Diekhans, and D. Haussler. A discriminative framework for detecting remote
protein homologies. Journal of Computational Biology, 7(1-2):95–114, 2000.
T. Jaakkola and D. Haussler. Exploiting generative models in discriminative classifiers. In
Advances in Neural Information Processing Systems 11, pages 487–493, Cambridge, MA, 1999.
MIT Press.
T. Jebara, R. Kondor, and A. Howard. Probability product kernels. Journal of Machine Learning,
5:819–844, 2004.
R. Jin and Z. Ghahramani. Learning with multiple labels. In Advances in Neural Information
Processing Systems 15, Cambridge, MA, 2003. MIT Press.
T. Joachims. A probabilistic analysis of the Rocchio algorithm with TFIDF for text categorization.
In Machine Learning: Proceedings of the Fourteenth International Conference, pages 143–151,
1997. URL ftp://ftp.cs.cmu.edu/afs/cs/user/thorsten/www/icml97.ps.Z.
T. Joachims. Text categorization with support vector machines: Learning with many relevant
features. In Tenth European Conference on Machine Learning, pages 137–142, 1998.
T. Joachims. Transductive inference for text classification using support vector machines. In Pro-
ceedings of the Sixteenth International Conference on Machine Learning, pages 200–209, Bled,
488 REFERENCES
D. Miller and H. Uyar. A mixture of experts classifier with learning based on both labelled
and unlabelled data. In M. Mozer, M. Jordan, and T. Petsche, editors, Advances in Neural
Information Processing Systems 9, pages 571–577, Cambridge, MA, 1997. MIT Press.
T. P Minka. A Family of Algorithms for Approximate Bayesian Inference. PhD thesis, Mas-
sachusetts Institute of Technology, Cambridge, MA, 2001.
T. Mitchell. Machine Learning. McGraw Hill, New York, 1997.
G. D. Murray and D. M. Titterington. Estimation problems with data from a mixture. Applied
Statistics, 27(3):325–334, 1978.
A. G. Murzin, S. E. Brenner, T. Hubbard, and C. Chothia. SCOP: A structural classification
of proteins database for the investigation of sequences and structures. Journal of Molecular
Biology, 247:536–540, 1995.
E. A. Nadaraya. On estimating regression. Theory of Probability and Its Applications, 9:141–142,
1964.
E. A. Nadaraya. Nonparametric Estimation of Probability Densities and Regression Curves.
Kluwer, Dordrecht, the Netherlands, 1989.
R. M. Neal and G. E. Hinton. A view of the EM algorithm that justifies incremental, sparse,
and other variants. In M. I. Jordan, editor, Learning in Graphical Models, pages 355–368,
Cambridge, MA, 1998. MIT Press.
S. A. Nene, S. K. Nayar, and H. Murase. Columbia object image library (COIL-100). Technical
Report CUCS-006-96, Columbia University, New York, February 1996.
Y. Nesterov and A. Nemirovsky. Interior-point polynomial methods in convex programming:
Theory and applications. SIAM, 13, 1994.
A. Y. Ng and M. Jordan. On discriminative vs. generative classifiers: A comparison of logistic
regression and naive Bayes. In T. G. Dietterich, S. Becker, and Z. Ghahramani, editors,
Advances in Neural Information Processing Systems, volume 14, pages 841–848, Cambridge,
MA, 2001. MIT Press.
A. Y. Ng, M. Jordan, and Y. Weiss. On spectral clustering: Analysis and an algorithm. In T. G.
Dietterich, S. Becker, and Z. Ghahramani, editors, Advances in Neural Information Processing
Systems 14, Cambridge, MA, 2002. MIT Press.
K. Nigam. Using unlabeled data to improve text classification. Technical Report doctoral
dissertation, CMU-CS-01-126, Carnegie Mellon University, Pittsburgh, 2001.
K. Nigam and R. Ghani. Analyzing the effectiveness and applicability of co-training. In Ninth
International Conference on Information and Knowledge Management, pages 86–93, 2000.
K. Nigam, A. McCallum, S. Thrun, and T. Mitchell. Learning to classify text from labeled
and unlabeled documents. In Proceedings of the Fifteenth National Conference on Artificial
Intelligence, pages 792–799, 1998.
K. Nigam, A. McCallum, S. Thrun, and T. Mitchell. Text classification from labeled and unlabeled
documents using EM. Machine Learning, 39(2/3):103–134, 2000.
A. O’Hagan. Some Bayesian numerical analysis. In J. M. Bernardo, J. O. Berger, A. P. Dawid,
and A. F. M. Smith, editors, Bayesian Statistics 4, pages 345–363, Valencia, 1992. Oxford
University Press.
T. O’Neill. Normal discrimination with unclassified observations. Journal of the American
Statistical Association, 73(364):821–826, 1978.
D. Opitz and J. Shavlik. Generating accurate and diverse members of a neural-network ensemble.
In Advances in Neural Information Processing Systems 8, 1996.
M. Ouimet and Y. Bengio. Greedy spectral embedding. In Proceedings of the Tenth International
Workshop on Artificial Intelligence and Statistics, 2005.
A. Papoulis and S. U. Pillai. Probability, Random Variables and Stochastic Processes. McGraw-
Hill, New York, 4th edition, 2001.
J. Park, K. Karplus, C. Barrett, R. Hughey, D. Haussler, T. Hubbard, and C. Chothia. Sequence
comparisons using multiple sequences detect twice as many remote homologues as pairwise
methods. Journal of Molecular Biology, 284(4):1201–1210, 1998.
S. Park and B. Zhang. Large scale unstructured document classification using unlabeled data and
syntactic information. In Pacific-Asia Conference on Knowledge Discovery and Data Mining,
LNCS vol. 2637, pages 88–99. Springer-Verlag, 2003.
492 REFERENCES
D. Schuurmans and F. Southey. Metric-based methods for adaptive model selection and regular-
ization. Special issue on new methods for model selection and model combination. Machine
Learning, 48(1-3):51–84, 2002.
D. Schuurmans, L. Ungar, and D. Foster. Characterizing the generalization performance of model
selection strategies. In Proceedings of International Conference on Machine Learning, pages
340–348, 1997.
G. Schwarz. Estimating the dimension of a model. Annals of Statistics, 6:461–464, 1978.
B. Schwikowski, P. Uetz, and S. Fields. A network of protein-protein interactions in yeast. Nature
Biotechnology, 18:1257–1261, 2000.
H. J. Scudder. Probability of error of some adaptive pattern-recognition machines. IEEE
Transactions on Information Theory, 11:363–371, 1965.
M. Seeger. Input-dependent regularization of conditional density models, 2000a. Technical Report,
Institute for ANC, Edinburgh, UK. See www.kyb.tuebingen.mpg.de/bs/people/seeger.
M. Seeger. Learning with labeled and unlabeled data, 2000b. Technical Report, Institute for
ANC, Edinburgh, UK. See www.kyb.tuebingen.mpg.de/bs/people/seeger.
M. Seeger. Covariance kernels from Bayesian generative models. In T. G. Dietterich, S. Becker,
and Z. Ghahramani, editors, Advances in Neural Information Processing Systems 14, pages
905–912, Cambridge, MA, 2002. MIT Press.
E. Segal, M. Shapira, A. Regev, D. Pe’er, D. Botstein, D. Koller, and N. Friedman. Module
networks: Identifying regulatory modules and their condition specific regulators from gene
expression data. Nature Biotechnology, 34(2):166–176, 2003a.
E. Segal, H. Wang, and D. Koller. Discovering molecular pathways from protein interaction and
gene expression data. Bioinformatics, 19:i264–i272, July 2003b.
J. A. Sethian. Level Set Methods and Fast Marching Methods. Cambridge University Press,
Cambridge, UK, 1999.
F. Sha and L. K. Saul. Analysis and extension of spectral methods for nonlinear dimensionality
reduction. In Proceedings of the Twenty-second International Conference on Machine Learning,
Bonn, Germany, 2005.
B. Shahshahani and D. Landgrebe. The effect of unlabeled samples in reducing the small sample
size problem and mitigating the Hughes phenomenon. IEEE Transactions on Geoscience and
Remote Sensing, 32(5):1087–1095, September 1994. URL http://dynamo.ecn.purdue.edu/
~landgreb/GRS94.pdf.
J. Shawe-Taylor, P. L. Bartlett, R. C. Williamson, and M. Anthony. Structural risk minimization
over data-dependent hierarchies. IEEE Transactions on Information Theory, 44(5):1926–1940,
1998.
J. Shawe-Taylor and N. Cristianini. Kernel methods for Pattern Analysis. Cambridge University
Press, Cambridge, UK, 2004.
N. Shental, A. Bar-Hillel, T. Hertz, and D. Weinshall. Computing Gaussian mixture models with
EM using equivalence constraints. In Advances in Neural Information Processing Systems 16,
pages 465–472, 2004.
J. Shi and J. Malik. Normalized cuts and image segmentation. IEEE Transactions on Pattern
Analysis and Machine Intelligence, 22(8):888–905, 2000.
R. Shibata. An optimal selection of regression variables. Biometrika, 68:45–54, 1981.
V. Sindhwani. Kernel machines for semi-supervised learning, 2004. Technical Report, masters
thesis, University of Chicago.
V. Sindhwani, W. Chu, and S. S. Keerthi. Semi-supervised Gaussian processes, 2006. Technical
Report, Yahoo! Research.
V. Sindhwani, P. Niyogi, and M. Belkin. Beyond the point cloud: From transductive to semi-
supervised learning. In Proceedings of the International Conference on Machine Learning,
2005.
T. Smith and M. Waterman. Identification of common molecular subsequences. Journal of
Molecular Biology, 147:195–197, 1981.
A. Smola and R. Kondor. Kernels and regularization on graphs. In Conference on Learning
Theory, 2003.
P. Sollich. Probabilistic interpretation and Bayesian methods for support vector machines. In
494 REFERENCES
A. L. Yuille, P. Stolorz, and J. Utans. Statistical physics, mixtures of distributions, and the EM
algorithm. Neural Computation, 6(2):334–340, 1994.
S. Zelikovitz and H. Hirsh. Improving short-text classification using unlabeled background
knowledge to assess document similarity. In Proceedings of the Seventeenth International
Conference on Machine Learning, 2000.
T. Zhang and F. Oles. A probability analysis on the value of unlabeled data for classification
problems. In International Joint Conference on Machine Learning, pages 1191–1198, 2000.
Y. Zhang, M. Brady, and S. Smith. Hidden Markov random field model and segmentation of brain
MR images. IEEE Transactions on Medical Imaging, 20(1):45–57, 2001.
Z. Zhang and H. Zha. Principal manifolds and nonlinear dimensionality reduction by local tangent
space alignment. SIAM Journal of Scientific Computing, 26(1):313–338, 2004.
D. Zhou, O. Bousquet, T. N. Lal, J. Weston, and B. Schölkopf. Learning with local and global
consistency. In S. Thrun, L. Saul, and B. Schölkopf, editors, Advances in Neural Information
Processing Systems 16, pages 321–328. MIT Press, Cambridge, MA, 2004.
D. Zhou, J. Huang, and B. Schölkopf. Learning from labeled and unlabeled data on a directed
graph. In L. De Raedt and S. Wrobel, editors, Proceedings of the Twenty-second International
Conference on Machine Learning, 2005a.
D. Zhou, B. Schölkopf, and T. Hofmann. Semi-supervised learning on directed graphs. In L. K.
Saul, Y. Weiss, and L. Bottou, editors, Advances in Neural Information Processing Systems
18, pages 1633–1640, Cambridge, MA, 2005b. MIT Press.
X. Zhu and Z. Ghahramani. Learning from labeled and unlabeled data with label propagation.
Technical Report CMU-CALD-02-107, Carnegie Mellon University, Pittsburgh, 2002.
X. Zhu, Z. Ghahramani, and J. Lafferty. Combining active learning and semi-supervised learning
using Gaussian fields and harmonic functions. In ICML-2003 Workshop on the Continuum
from Labeled to Unlabeled Data in Machine Learning, pages 912–912, Washington, DC, 2003a.
X. Zhu, Z. Ghahramani, and J. Lafferty. Semi-supervised learning using Gaussian fields and
harmonic functions. In Twentieth International Conference on Machine Learning, pages 912–
912, Washington, DC, 2003b. AAAI Press.
X. Zhu, J. Lafferty, and Z. Ghahramani. Semi-supervised learning: From Gaussian fields to
Gaussian processes. Technical Report CMU-CS-03-175, Carnegie Mellon University, Pittsburgh,
2003c.
Notation and Symbols
Sets of Numbers
N the set of natural numbers, N = {1, 2, . . . }
R the set of reals
[n] compact notation for {1, . . . , n}
x ∈ [a, b] interval a ≤ x ≤ b
x ∈ (a, b] interval a < x ≤ b
x ∈ (a, b) interval a < x < b
|C| cardinality of a set C (for finite sets, the number of elements)
Data
X the input domain
d (used if X is a vector space) dimension of X
M number of classes (for classification)
l, u number of labeled, unlabeled training examples
n total number of examples, n = l + u.
i, j indices, often running over [l] or [n]
xi input patterns xi ∈ X
yi classes yi ∈ [M ] (for regression: target values yi ∈ R)
X a sample of input patterns, X = (x1 , . . . , xn )
Y a sample of output targets, Y = (y1 , . . . , yn )
Xl labeled part of X, Xl = (x1 , . . . , xl )
Yl labeled part of Y , Yl = (y1 , . . . , yl )
Xu unlabeled part of X, Xu = (xl+1 , . . . , xl+u )
Yu unlabeled part of Y , Yu = (yl+1 , . . . , yl+u )
500 Notation and Symbols
Kernels
H feature space induced by a kernel
Φ feature map, Φ : X → H
k (positive definite) kernel
K kernel matrix or Gram matrix, Kij = k(xi , xj )
Functions
ln logarithm to base e
log2 logarithm to base 2
f a function, often from X or [n] to R, RM or [M ]
F a family of functions
Lp (X) function spaces, 1 ≤ p ≤ ∞
Probability
P{·} probability of a logical formula
P(C) probability of a set (event) C
p(x) density evaluated at x ∈ X
E [·] expectation of a random variable
Var [·] variance of a random variable
2
N(μ, σ ) normal distribution with mean μ and variance σ 2
Notation and Symbols 501
Graphs
g graph g = (V, E) with nodes V and edges E
G set of graphs
W weighted adjacency matrix of a graph (Wij = 0 ⇔ (i, j) ∈ E)
D (diagonal) degree matrix of a graph, Dii = j Wij
L normalized graph Laplacian, L = I − D−1/2 WD−1/2
L unnormalized graph Laplacian, L = D − W
SVM-related
ρf (x, y) margin of function f on the example (x, y), i.e., y · f (x)
ρf margin of f on the training set, i.e., minm
i=1 ρf (xi , yi )
h VC dimension
C regularization parameter in front of the empirical risk term
λ regularization parameter in front of the regularizer
w weight vector
b constant offset (or threshold)
αi Lagrange multiplier or expansion coefficient
βi Lagrange multiplier
α, β vectors of Lagrange multipliers
ξi slack variables
ξ vector of all slack variables
Q Hessian of a quadratic program
Miscellaneous
IA characteristic (or indicator) function on a set A,
i.e., IA (x) = 1 if x ∈ A and 0 otherwise
δij Kronecker δ (δij = 1 if i = j, 0 otherwise)
δx Dirac δ, satisfying δx (y)f (y)dy = f (x)
O(g(n)) a function f (n) is said to be O(g(n)) if there exist constants C > 0
and n0 ∈ N such that |f (n)| ≤ Cg(n) for all n ≥ n0
o(g(n)) a function f (n) is said to be o(g(n)) if there exist constants c > 0
and n0 ∈ N such that |f (n)| ≥ cg(n) for all n ≥ n0
rhs/lhs shorthand for “right-/left-hand side”
the end of a proof
Contributors
Maria-Florina Balcan
Computer Science Department
Carnegie Mellon University
[email protected]
Arindam Banerjee
Department of Computer Science and Engineering
University of Minnesota
[email protected]
Sugato Basu
Department of Computer Sciences
University of Texas at Austin
[email protected]
Mikhail Belkin
Department of Computer Science and Engineering
Ohio State University
[email protected]
Yoshua Bengio
Département d’Informatique et Recherche Opérationnelle
Université de Montréal
[email protected]
Mikhail Bilenko
Department of Computer Sciences
University of Texas at Austin
[email protected]
Avrim Blum
Computer Science Department
Carnegie Mellon University
[email protected]
504 Contributors
Christopher J. C. Burges
Text Mining, Search and Navigation Group
Microsoft Research
[email protected]
Ira Cohen
Enterprise Systems and Software Lab
HP Labs
[email protected]
Adrian Corduneanu
Computer Science and Artificial Intelligence Laboratory
Massachussets Institute of Technology
[email protected]
Fabio G. Cozman
Engineering School
University of Sao Paulo
[email protected]
Nello Cristianini
Department of Engineering Mathematics
University of Bristol
[email protected]
Tijl De Bie
OKP Research Group
K.U.Leuven
[email protected]
Olivier Delalleau
Département d’Informatique et Recherche Opérationnelle
Université de Montréal
[email protected]
Zoubin Ghahramani
Department of Engineering
University of Cambridge
[email protected]
Yves Grandvalet
Heudiasyc
Université de Technologie de Compiègne
[email protected]
Contributors 505
Yuhong Guo
Department of Computing Science
University of Alberta
[email protected]
Jihun Ham
Department of Electrical and Systems Engineering
University of Pennsylvania
[email protected]
Eugene Ie
Department of Computer Science and Engineering
University of California at San Diego
[email protected]
Tommi Jaakkola
Computer Science and Artificial Intelligence Laboratory
Massachussets Institute of Technology
[email protected]
Thorsten Joachims
Department of Computer Science
Cornell University
[email protected]
Michael I. Jordan
Department of Statistics
Department of Electrical Engineering and Computer Science
University of California at Berkeley
[email protected]
Jaz Kandola
Gatsby Computational Neuroscience Unit
University College London
[email protected]
John Lafferty
Computer Science Department
Carnegie Mellon University
[email protected]
Neil D. Lawrence
Department of Computer Science
University of Sheffield
[email protected]
506 Contributors
Nicolas Le Roux
Département d’Informatique et Recherche Opérationnelle
Université de Montréal
[email protected]
Daniel D. Lee
Department of Electrical and Systems Engineering
University of Pennsylvania
[email protected]
Christina Leslie
Center for Computational Learning Systems
Columbia University
[email protected]
Andrew McCallum
Department of Computer Science
University of Massachusetts Amherst
[email protected]
Tom Mitchell
Machine Learning Department
Carnegie Mellon University
[email protected]
Raymond Mooney
Department of Computer Sciences
University of Texas at Austin
[email protected]
Kamal Nigam
Google
[email protected]
Partha Niyogi
Department of Computer Science
University of Chicago
[email protected]
Alon Orlitsky
Department of Electrical and Computer Engineering
University of California at San Diego
[email protected]
John C. Platt
Knowledge Tools Group
Microsoft Research
[email protected]
Sajama
Department of Electrical and Computer Engineering
University of California at San Diego
[email protected]
Lawrence K. Saul
Department of Computer and Information Science
University of Pennsylvania
[email protected]
Dale Schuurmans
Department of Computing Science
University of Alberta
[email protected]
Bernhard Schölkopf
Department of Empirical Inference
Max Planck Institute for Biological Cybernetics
[email protected]
Matthias Seeger
Department of Empirical Inference
Max Planck Institute for Biological Cybernetics
[email protected]
Fei Sha
Department of Computer and Information Science
University of Pennsylvania
[email protected]
Vikas Sindhwani
Department of Computer Science
University of Chicago
[email protected]
Finnegan Southey
Department of Computing Science
University of Alberta
[email protected]
Koji Tsuda
Department of Empirical Inference
Max Planck Institute for Biological Cybernetics
[email protected]
Vladimir Vapnik
NEC Laboratories America
[email protected]
Kilian Q. Weinberger
Department of Computer and Information Science
University of Pennsylvania
[email protected]
Jason Weston
NEC Laboratories America
[email protected]
Dana Wilkinson
School of Computer Science
University of Waterloo
[email protected]
Dengyong Zhou
NEC Laboratories America
[email protected]
Xiaojin Zhu
Department of Computer Science
University of Wisconsin-Madison
[email protected]
Index