# Multi-Label Classification With R

In Machine Learning, single label classification problems are concerned with learning a model from a set of instances that are associated to only one label $l$ from a set of disjoint labels $L$. If the number of labels in $L$ is equal to 2, then the learning problem is called a $\it{binary}$ classification. If the number of labels is more than 2, then it is a $\it{multi\text{-}class}$ classification problem.

But applications such as text categorization, medical diagnosis, music categorization may belong to more than one class. For example:
(1) A movie can simultaneously belong to action, crime, thriller and drama categories.
(2) In medical diagnosis a patient may suffer from diabetes and cancer both at the same time.
(3) A text document that talks about scientific contribution in medical science may belong to both science and health category.

These types of problems belong to $\it{multi\text{-}label}$ classification.
The method of predicting a set of labels to each instance of data, instead of only one label, is known as $\it{multi\text{-}label}$ classification. That is, in multi-label classification each instance will be associated to a set of labels $Y\subseteq L$ instead of a single $l \in L$.

### Single-label vs. Multi-label

Let $L$ be a finite set of labels $L = \{ \lambda_j : j = 1, \ldots, n\}$ and $D$ be the set of instances $D = \{(x_i , Y_ i ) : i = 1, \ldots, m\}$, where $x_i$ is the vector of features of an instance and $Y_i \subseteq L$ is the subset of labels of the instance $x_i$. The subset $Y_i$ is then defined as a binary vector $Y_i = \{y_1 , y_2 , . . . , y_n \}$, where each $y_j \in \{0, 1\}$. $y_j = 1$ indicates the presence of a label $λ_j$ in the set of relevant labels for $x_i$ . Suppose every instance $x_i$ has $k$ features $f_1, f_2, \ldots, f_k$.

$\newcommand\T{\Rule{0pt}{1em}{0em}}$
Single-Label $\bf{ y \in \{0,1\}}$
\begin{array}{|c|c c c c c|c|}
\hline & f_1 & f_2 & f_3 & f_4 & f_5 & \lambda \T \\\hline
x_1 & 2 \T & 0.1 & 4 & 1.3 & 2 & 1 \\
x_2 & 1 \T & 0.5 & 2 & 1.7 & 0 & 0 \\
x_3 & 3 \T & 0.4 & 1 & 2.1 & 3 & 0 \\
x_4 & 0 \T & 0.2 & 3 & 1.6 & 1 & 1 \\
x_5 & 5 \T & 0.3 & 0 & 1.1 & 2 & 1 \\
x_6 & 4 \T & 0.6 & 6 & 1.5 & 3 & 0 \\\hline
\end{array}

### Multi-label methods

The multi-label learning approaches can be organized in three main families:
1. the problem transformation methods transform the multi-label learning problem into one or several single-label classification or regression problems,
2. the algorithm adaptation methods extend single-label learning algorithms for the multi-label data,
3. the ensemble methods use ensembles of classifiers either from the problem transformation or the algorithm adaptation approaches.

### Problem Transformation

(1) Binary Relevance (BR) is probably the most popular transformation method. It learns |L| binary classifiers, one for each label.
(2) Classifier Chain (CC) is an extension of BR that not only trains one classifier per label but also extends the dimensionality of each classifier’s training data with labels of the previous classifiers,in a chain, as new features.

#### Binary Relevance (BR)

Binary Relevance is one of the most popular transformation methods which learns $n$ binary classifiers ($n = |L|$) one for each label. BR transforms the original dataset into $n=|L|$ datasets, where each dataset contains all the instances of the original dataset.

$n = |L|$ separate binary problems (one for each label)

 \begin{array}{|c|c|} \hline X & \lambda_1 \T \\\hline x_1 \T & 1 \\ x_2 \T & 0 \\ x_3 \T & 0 \\ x_4 \T & 1 \\ x_5 \T & 1 \\ x_6 \T & 0 \\\hline \end{array} \begin{array}{|c|c|} \hline X & \lambda_2 \T \\\hline x_1 \T & 0\\ x_2 \T & 0\\ x_3 \T & 1 \\ x_4 \T & 0\\ x_5 \T & 0\\ x_6 \T & 1 \\\hline \end{array} \begin{array}{|c|c|} \hline X & \lambda_3 \T \\\hline x_1 \T & 1\\ x_2 \T & 0\\ x_3 \T & 0 \\ x_4 \T & 0\\ x_5 \T & 0\\ x_6 \T & 0 \\\hline \end{array} \begin{array}{|c|c|} \hline X & \lambda_4 \T \\\hline x_1 \T & 0\\ x_2 \T & 1\\ x_3 \T & 0 \\ x_4 \T & 1\\ x_5 \T & 0\\ x_6 \T & 0 \\\hline \end{array}

Once these datasets are ready, it is easy to train with any off-the-shelf binary classifier.

#### Multi-label Data: Datasets

\begin{array}{|c|c|c|c|c|c|}
\hline Dataset & F & N & T & S & |L| \T \\\hline
Emotions \T & 72 & 592 & 118 & 474 & 6\\
Yeast \T & 103 & 2417 & 483 & 1934 & 14\\
Scene \T & 294 & 2407 & 481 & 1926 & 6 \\
Medical \T & 1449 & 978 & 333 & 645 & 45 \\\hline
\end{array} $|L|$ – number of labels; $N$ – number of examples; $F$ – number of input feature attributes; $T$ – number of testing examples; $S$ – number of training examples.

#### R Code

RWeka is an R interface to Weka. Weka is a collection of machine learning algorithms for data mining tasks written in Java, containing tools for data pre-processing, classification, regression,clustering,association rules and visualization. We use J48 which is extention of C4.5 tree algorithm in weka classifiers trees.

# include library
library(RWeka)
# specify number of features
nFeatures <- list()
nFeatures[["yeast"]] <- 103
nFeatures[["emotions"]] <- 72
nFeatures[["scene"]] <- 294
nFeatures[["medical"]] <- 1449
{
trainFile <- paste(".../multilabel/",dataset,"/",dataset,"-train.arff",sep="")
testFile <- paste(".../multilabel/",dataset,"/",dataset,"-test.arff",sep="")
return(list(trainDataX=trainData[,1:nFeatures], trainDataY=trainData[,-(1:nFeatures)],
testDataX=testData[,1:nFeatures], testDataY=testData[,-(1:nFeatures)]))
}
dataset <- "scene" # yeast emotions scene medical
trainDataX = data$trainDataX trainDataY = data$trainDataY
testDataX = data$testDataX testDataY = data$testDataY
labelNames <- colnames(trainDataY)
predictions <- matrix(0,nrow=nrow(testDataX), ncol=length(labelNames))
predictions <- data.frame(predictions)
colnames(predictions) <- labelNames
for (label in labelNames) {
y <- trainDataY[c(label)]
cat(label,"\n")
J48Train = cbind(y, trainDataX)
formula = as.formula(paste(label, "~."))
model = J48(formula, data=J48Train)
# predict
predictions[,label] <- predict(model, newdata=testDataX, type = c( "class", "probability" ))
} # label

#### Multi-label Evaluation

The evaluation of methods that learn from multi-label data requires metrics that differ from those employed for single-label data. Given an instance $x_i$ , the resulting set of labels predicted by a multi-label classifier is denoted by $Z_i$.

Hamming loss: The percentage of incorrect labels predicted in relation to the total number of labels, defined as
$$\text{Hamming Loss} = \frac{1}{N}\sum_{i=1}^N \frac{|Y_i \Delta Z_i|}{|L|}$$ where $\Delta$ is the symmetric difference between two sets(in set theory), which is equivalent to the XOR operator in Boolean logic.

Accuracy: The proximity of the predicted labels in relation to the correct labels, defined as
$$\text{Accuracy} = \frac{1}{N}\sum_{i=1}^N \frac{|Y_i \bigcap Z_i|}{|Y_i \bigcup Z_i|}$$

Precision: The quantity of predicted labels that are correct, defined as
$$\text{Precision} = \frac{1}{N}\sum_{i=1}^N \frac{|Y_i \bigcap Z_i|}{|Z_i|}$$

Recall: The quantity of correct labels that are predicted, defined as
$$\text{Recall} = \frac{1}{N}\sum_{i=1}^N \frac{|Y_i \bigcap Z_i|}{|Y_i|}$$

library(pROC)
THRESHOLD <- 0.5
#
# compute Hamming Loss. Note the predictions must be threshold into 0/1
#
# Y is true label matrix N * L where N is # of test instances and L is # of labels
# Z is prediction matrix N * L where N is # of test instances and L is # of labels
#
HammingLoss <- function(Y,Z)
{
if (nrow(Y) != nrow(Z) || ncol(Y) != ncol(Z)) {
stop("Dim of Y and Z does not match...")
}
nRow <- nrow(Y)
nCol <- ncol(Y)
Z[Z < THRESHOLD] <- 0
Z[Z >= THRESHOLD] <- 1
return(sum(Y != Z)/(nRow*nCol))
}
# evaluate
results <- HammingLoss(data\$testDataY, predictions)
print(results)

#### MEKA: A Multi-label Extension to WEKA

Java implementations of multi-label algorithms are available in the Meka software packages. MEKA is a WEKA-based framework for multi-label classification and evaluation. It also serves as a wrapper for MULAN.

MEKA can be used from the command line or GUI in any ensemble scheme and contains many evaluation metrics. Also thresholds calibrated automatically (or optionally, set ad-hoc)
Some multi-label classification algorithms available in MEKA are:
• Binary Relevance (BR)
• Binary Relevance – Random Subspace (BRq)
• Classifier Chains (CC)
• Label Powerset (LP)
• Multi-Label k-Nearest Neighbor (MLkNN)
• Random k-Labelsets (RAkEL).

## 2 thoughts on “Multi-Label Classification With R”

1. how to find this error in the place of data <- MultiLabel.Load.arff(dataset, nFeatures[[dataset]])