---
title: "Simple Use Case for the CBET on classification"
author: "Francisco J. Valverde-Albacete"
date: "Nov, 28th, 2015, modified Jul, 4th, 2018"
output: html_document
---
This vignette aims at presenting the features of the Channel Binary Entropy Triangle (CBET) for the evaluation of supervised classification in an incremental manner.
# Environment construction
```{r, message=F, warning=F, environment}
library(tidyverse) # That (in)famous Mr. Wickham!
library(caret) # To build the classifiers.
library(mlbench) # Many databases for ML tasks
library(vcd) # Categorical benchmarks
library(candisc) # Wine dataset
library(entropies) # Processing and visualizing joint entropies
library(compositions)# Statistics work differently on compositional data
```
Some top level switches and options gathered in one place.
```{r switches}
debugLevel <- 0 # Debug level 0-non-existent, 1-minimal, the greater the more verbose.
fancy <- TRUE # set this for nicer on-screen visualization.
#fancy <- FALSE # Set this for either printing matter or more austere plots.
getPlot <- TRUE # Flag to obtain plots for publication.
getPlot <- FALSE #Comment to get .jpeg files rather than plots of ets.
knitr::opts_chunk$set(comment=NA, fig.width=6, fig.height=4)
if (getPlot)
knitr::opts_chunk$set(dev = 'pdf') # plots in pdf, better for publication
```
Some choices for visualization parameters, and primitives.
```{r}
splitShapesForTypes <- c("X"=4, "Y"=1, "XY"=20) #To draw split diagrams
# Naive transformation from factors to numbers in 0 to num.factors - 1
factor.as.numeric <- function(f){
nums <- as.numeric(f)
return(nums - min(nums))
}
```
## Datasets available for entropy analysis in this package
```{r generic dataset initialization}
datasets <- getDatasets()
datasets
if (getPlot){# For papers, it helps to have the table in latex.
library(xtable)
print.xtable(xtable(dplyr::select(datasets, name, K, n, m)))
}
```
```{r dataset choice}
# Uncomment the name of the database to be analyzed
# dsName <- "Ionosphere"
# dsName <- "iris"
# dsName <- "BreastCancer"#Supply the name of the database to be analyzed#FACTORS
# dsName <- "Wine"
# dsName <- "Glass"#Cannot take logarithms for PCA: zeros returns -Inf
dsName <- "Arthritis"#It has non-numeric factors.
# dsName <- "Sonar"
dsRecord <- filter(datasets, name == dsName)
ds <- evalDataset(dsName)
```
We'll use the `r dsName` data throughout this vignette.
# Classifier design
## Basic data from the set for classification
```{r}
#id columns, if existent in dsRecord$idNumber
# log transform of everything but the class and any id if existant.
if (!is.na(dsRecord$idNumber)){
ds <- ds[,-dsRecord$idNumber]
}
#class column
ds.classNum <- which(names(ds)==dsRecord$className)
#take away the class, but keep it just in case.
class.ds <- ds[, ds.classNum]#saving the class. Warning A FACTOR!
ds <- ds[,-ds.classNum]
ds <- ds %>%
#transform factors to number
mutate_if(is.factor,factor.as.numeric) %>%
# Dispose of columns with NaN
select_if(function(v) !any(is.na(v))) %>%
# Dispose of constant columns: they carry no information
select_if(function(v)(var(v) > 0))
ncols <- ncol(ds)#Mnemonic shortcut: num of columns
dsDiscretized <- infotheo::discretize(ds, disc="equalwidth")
if (dsName != "Ionosphere"){
log.ds <- log(ds)#this has to be made conditional on the database
log.dsDiscretized <- infotheo::discretize(log.ds)
#TODO: try to get rid of annoying warnings each time entropy is called.
}
X <- as.matrix(ds)
Y <- class.ds
classes <- unique(Y)
numC <- length(classes)
print(sprintf("%s has %d classes with distribution: ", dsName, numC))
summary(Y)
```
## Design a simple classifier
Throughout this vignette we use a k-nearest neigbour classifier.
In this initial evaluation, we first carry out a basic random partitioning of the data.
```{r random split}
set.seed(2117)
inTrain <- createDataPartition(y=Y,
p=0.80, # Tries to do stratified sampling
list=FALSE)
trainX <- X[inTrain,]; trainY <- Y[inTrain]
testX <- X[-inTrain,]; testY <- Y[-inTrain]
#Basic model fitting
fit <- train(x=trainX, y=trainY,
method="knn",
tuneLength = 15,
preProcess = c("center", "scale"))
```
## Evaluation using the CBET
Evaluation proceeds by obtaining the confusion matrices for the train and test sets...
```{r random split model}
## obtain a training caret::confusion matrix
trCM <- confusionMatrix(predict(fit,trainX), trainY)
#trEntropies <-
trCoords <- jentropies(t(trCM$table))
## prediction and the test confusion matrix
predicted <- predict(fit, testX)
teCM <- confusionMatrix(predicted,testY)
#teEntropies <-
teCoords <- jentropies(t(teCM$table))
```
And then printing the results in the Entropy Triangle for a single classification experiment:
```{r random split CBET}
experiments <- rbind(cbind(trCoords, Phase="train", method="knn"),
cbind(teCoords, Phase="test", method="knn")
)
experiments <- cbind(dSet=dsName, experiments)
# The basic plot for the entropy triangle training and testX in different colours and glyphs
gp <- ggmetern(data=experiments %>% filter(type=="XY"), fancy) +
geom_point(aes(colour=Phase, shape=dSet), size=1) +
labs(shape="Dataset") +
scale_colour_brewer(palette="Set1")
gp
```
Note that, at least for `r dsName` (and *iris*), there is a suspicious behaviour in the plot in that the classifier achieves a better information transfer (correlated with accuracy) in test than in training.
```{r comparing accuracies}
print(sprintf("Training accuracy= %f vs. Testing accuracy=%f ", trCM$overall[1], teCM$overall[1]))
```
This is part of the "evaluation paradox" for classifications: since the test must have a higher variance, there will be instances of train-test partitions where *the performance on the testing set will be higher that on the training set*. This is partially solved with n-fold validation.
# A better picture with n-fold validation
To confirm this intuition and get all the value for our coin in the entropy triangle,
in the following, we use n-fold validation to visualize several experiments and their mean performance.
First we create the folds: the number of folds is a parameter of this script.
```{r n-fold partitioning}
numFolds <- 5
set.seed(1717) # For reproducibility
folds <- createFolds(Y, numFolds)
print("Check that the sampling was stratified...")
for(i in 1:numFolds){
print(summary(Y[folds[[i]]]))
}
summary(Y)
```
Run the experiments
```{r n-fold model validation, warning=FALSE}
models <- c("knn") #c("knn", "logreg")
results <- data.frame()
for (i in 1:numFolds){
for (m in models){
# 1. select training and testX data and classes
trainObs <- unlist(folds[-i])
testObs <- folds[[i]]
trainX <- X[trainObs, ]; trainY <- Y[trainObs]
testX <- X[testObs, ]; testY <- Y[testObs]
# 2. Fit the model with the
model <- train(x=trainX, y=trainY,
method=m,
tuneLength = 15,
preProcess = c("center", "scale"))
# 3. Estimate the labels for the train set: confusion matrix, entropies, etc.
trainYhat <- predict(model, trainX)
trainCM <- confusionMatrix(trainYhat, trainY)
print(trainCM$table)
# 4. Estimate the labels for the test set
testYhat <- predict(model, testX)
testCM <- confusionMatrix(testYhat, testY)
print(testCM$table)
# 5. Gather results for
# CAVEAT: our framework supposes that in confusion matrices, rows are indexed by
# the reference, hence the transposition below
results <- rbind(results,
evaluate(t(trainCM$table)) %>% mutate(Fold=i,method=m, Phase="train",
Acc=trainCM$overall[1]),
evaluate(t(testCM$table)) %>% mutate(Fold=i,method=m, Phase="test",
Acc=testCM$overall[1])
)
print(sprintf("Fold %d, method %s Train accuracy = %f\t Test accuracy= %f",
i, m, trainCM$overall[1],testCM$overall[1])
)
}
}
results <- cbind(dSet=dsName,results)#Watch it! This is only possible at last!
```
We show the plot for the result on a per-plot basis.
```{r detailed n-fold CBET}
eT <- ggmetern(data=results %>% filter(type=="XY"), fancy) +
geom_point(aes(colour=Phase, shape=dSet), size=2) +
labs(shape="Dataset") +
scale_colour_manual(values=c("blue","red")) # Don't trust the training, that is the red
eT
```
Clearly, some test results are better than training results. What about centrality and dispersion measures?
```{r mean of results}
results %>% filter(type=="XY") %>% select(-one_of(c("method"))) %>%
group_by(type, Phase) %>%
summarise(meanAcc=mean(Acc), sdAcc=sd(Acc), meanEMA = mean(EMA), sdEMA = sd(EMA))
```
This agrees with the theory that insists on the variance of the testing instances being higher.
Note that the Entropy-Modified Accuracy (EMA) is an alternative, and more pessimistic, measure than the Accuracy.
## Visualization with the split triangle
To use the split triangle to advantage we have to use an unbalanced dataset, e.g. Glass or Arthritis.
First we look at the entropies of the different folds.
```{r split triangle on one dataset}
eTsplit <- ggmetern(data=results %>% filter(type!="XY"), fancy) +
geom_point(aes(colour=Phase, shape=type), size=2) +
labs(shape="Split Entropies") +
scale_shape_manual(values=splitShapesForTypes) +
scale_colour_manual(values=c("train"="blue","test"="red"))
eTsplit + geom_text(data=results %>% filter(type == "XY"), aes(label=Fold,color=Phase))
#color="blue") size=4, vjust=2, hjust=1)
```
The number of the fold appears in the place where the aggregate entropies appeared in the previous CBET.
We can see that the input entropies $H_{P_X}$ are much more concentrated. This is because the sampling was stratified. While in theory it should obtain similar reference class distributions for all folds, e.g. the "x" above should lie in a line parallel to the left side of the triangle, in practice the stratification is imperfect.
Also the predicted class distributions are much less entropic than the reference class distributions: their uncertainty has decreased, contrary to the data processing inequality. The reason for this is the classifier concentrating on majority classes, i.e. *specializing*, which increases the *Accuracy*.
To better view this phenomenon, we obtain the (compositional) means of the entropy values.
```{r}
meanCompositions <- data.frame()
for(dsNam in unique(results$dSet)){
for(ph in unique(results$Phase)){
for(eType in unique(results$type)){
meanCompositions <-
rbind(meanCompositions,
cbind(dSet=dsNam, Phase=ph, type=eType,
data.frame(as.list(
mean(acomp(results %>%
filter(dSet==dsNam, Phase==ph, type==eType) %>%
select(one_of(c("DeltaH_P", "M_P", "VI_P"))))
)))
)
)
}
}
}
meanCompositions %>% filter(type=="XY")
```
And now we add it with a different glyph but the same colors.
```{r}
eTsplit + geom_point(data=meanCompositions, aes(colour=Phase, shape=type), size=4)
```
This shows that there is very little mutual information transfer. It was made worse by the classifier specializing, that is, increasing $\Delta H_{P_{XY}}$ at the expense of mutual information.
All in all, `r models` is a very bad classifier for `r dsName`.
# Session information
```{r}
sessionInfo()
```