We provide a detailed demo of the usage for the package.
Let (X,βY) be random variables where Xβββπ³ is a vector of d features and Yβββ{0,β1} represents a binary class label. A data set that contains independent observations {(xi,βyi)}iβ=β1nβ +β m sampled from the joint distribution of (X,βY) are often divided into training data {(xi,βyi)}iβ=β1n and test data {(xi,βyi)}iβ=βnβ +β 1nβ +β m.
Based on training data, a Ο(β ) is a mapping Οβ:βπ³βββ{0,β1} that returns the predicted class label given X. Classification error occurs when Ο(X)ββ βY, and the binary loss is defined as 1(Ο(X)ββ βY), where 1(β ) denotes the indicator function. The risk is defined as R(Ο)β=βE[1(Ο(X)ββ βY)]β=βP(Ο(X)ββ βY), which can be expressed as a weighted sum of type I and II errors: R(Ο)β=βP(Yβ=β0)R0(Ο)β +β P(Yβ=β1)R1(Ο), where R0(Ο)β=βP(Ο(X)ββ βY|Yβ=β0) denotes the type I error, and R1(Ο)β=βP(Ο(X)ββ βY|Yβ=β1) denotes the type II error.
The classical classification paradigm aims to find an oracle classifier Ο* by minimizing the risk, Ο*β=βargβminΟR(Ο).
In contrast, the NP classification paradigm aims to mimic the NP oracle classifier ΟΞ±* with respect to a pre-specified type I error upper bound Ξ±, ΟΞ±*β=βargβminΟ:Β R0(Ο)ββ€βΞ±R1(Ο), where Ξ± reflects usersβ conservative attitude (priority) towards the type I error. In practice, since we do not know the class distirbution, we do not expect to enforce that the type I error is bounded from above strictly; instead, we wish the type I error is bounded by Ξ± with probability at least 1β ββ Ξ΄, where Ξ΄βββ(0,β1) is another user-specified hyper-parameter.
The package provides two essential tools for NP classification: i). a function that adapts popular classification algorithms, such as linear discriminant analysis, logistic regression, support vector machines and random forrests to the NP paradigm; and ii) a function that constructs NP-ROC bands, an alternative to the popular ROC curves, that allows visualization and comparison of NP classifiers.
The simplest way to obtain the package is to install it directly from CRAN. Type the following command in R console:
Users may change the repos
options depending on their
locations and preferences. Other options such as the directories to
which we want to install the packages can be altered in the command. For
more details, see help(install.packages)
.
Here the package has been downloaded and installed to the default directories.
Alternatively, users can download the package source at http://cran.r-project.org/web/packages/nproc/index.html and type Unix commands to install it to the desired location.
Then we can load the nproc
package:
Here, we demonstrate how to construct a Neyman-Pearson classifier when a user has priority to control the type I error. Note that a user can also control the type II error under specific levels by switching the class labels and applying the npc function.
In the first step, we create a dataset (x,y) from a logistic regression model with 2 features and sample size 1000.
A visualization of the two classes.
plot(x[y==1,],col=1,xlim=c(-4,4),xlab='x1',ylab='x2')
points(x[y==0,],col=2,pch=2)
legend("topright",legend=c('Class 1','Class 0'),col=1:2,pch=c(1,2))
Then, we call the npc
function to construct
Neyman-Pearson classifiers. Suppose we want to use linear discriminant
analysis, we set method
= βldaβ. The default type I error
upper bound is alpha=0.05
, while the alpha
value can be changed to any user specified value in (0,β1). The tolerance upper bound Ξ΄ also has the default value at
0.05.
We can now evaluate the prediction performance of the NP classifier
fit
on a test set (xtest, ytest) generated as follows.
We calculate the overall accuracy of the classifier as well as the
realized type I error on test data. It is shown that the type I error is
smaller than alpha
.Strictly speaking, to demonstrate the
effectiveness of the fit
classifier under the NP paradigm,
we should repeat this experiment many times, and show that in 1β
ββ
Ξ΄ of these repetitions, type I
error (approximated by empirical type I error on a large test data set)
is smaller than alpha
.
pred = predict(fit,xtest)
fit.score = predict(fit,x)
accuracy = mean(pred$pred.label==ytest)
cat("Overall Accuracy: ", accuracy,'\n')
## Overall Accuracy: 0.771
ind0 = which(ytest==0)
typeI = mean(pred$pred.label[ind0]!=ytest[ind0]) #type I error on test set
cat('Type I error: ', typeI, '\n')
## Type I error: 0.06233766
The base classification method implemented in the npc
function includes the following options.
glm
function with family
= βbinomialβglmnet
in glmnet
packagesvm
in e1071
packagerandomForest
in
randomForest
packagelda
in
MASS
packagenaiveBayes
in e1071
packageada
in ada
packageNow, we can change the classification method of choice to logistic
regression and change alpha
to 0.1.
fit = npc(x, y, method = "logistic", alpha = 0.1)
pred = predict(fit,xtest)
accuracy = mean(pred$pred.label == ytest)
cat("Overall Accuracy: ", accuracy,'\n')
## Overall Accuracy: 0.823
ind0 = which(ytest == 0)
typeI = mean(pred$pred.label[ind0] != ytest[ind0]) #type I error on test set
cat('Type I error: ', typeI, '\n')
## Type I error: 0.1168831
To construct each NP classifier, npc
function randomly
splits class 0 sample into two equal
halves; the first half, together with the class 1 sample, is used to train the base
classification method, and the second half is reserved to find the
threshold on the scores. To increase stability, the nproc
package provides implementation of ensembled classifiers, which are
majority votes of the NP classifiers constructed from different sample
splitting. The parameter value split
is the number of
random splits. In the following codes, we change its number to 11.
fit = npc(x, y, method = "logistic", alpha = 0.1, split = 11)
pred = predict(fit,xtest)
accuracy = mean(pred$pred.label == ytest)
cat("Overall Accuracy: ", accuracy,'\n')
## Overall Accuracy: 0.806
ind0 = which(ytest == 0)
typeI = mean(pred$pred.label[ind0] != ytest[ind0]) #type I error on test set
cat('Type I error: ', typeI, '\n')
## Type I error: 0.0961039
Letβs compare the performance of different methods on our simulated dataset.
methodlist = c("logistic", "penlog", "randomforest", "svm",
"lda", "nb", "ada")
for(method in methodlist){
fit = npc(x, y, method = method, alpha = 0.05)
pred = predict(fit,xtest)
accuracy = mean(pred$pred.label==ytest)
cat(method, ': Overall Accuracy: ', accuracy,'\n')
ind0 = which(ytest==0)
typeI = mean(pred$pred.label[ind0]!=ytest[ind0]) #type I error on test set
cat(method, ': Type I error: ', typeI, '\n')
}
## logistic : Overall Accuracy: 0.769
## logistic : Type I error: 0.05454545
## penlog : Overall Accuracy: 0.769
## penlog : Type I error: 0.05974026
## randomforest : Overall Accuracy: 0.708
## randomforest : Type I error: 0.05974026
## svm : Overall Accuracy: 0.531
## svm : Type I error: 0.06493506
## lda : Overall Accuracy: 0.771
## lda : Type I error: 0.06233766
## nb : Overall Accuracy: 0.785
## nb : Type I error: 0.07012987
## ada : Overall Accuracy: 0.761
## ada : Type I error: 0.05974026
The function nproc
constructs Neyman-Pearson Receiver
Operator Characteristic (NP-ROC) band for an NP classifier. When we plot
an NP-ROC band, the horizontal axis is the alpha
value, and
the vertical axis is the type II error conditioning on training data.
When we section the band vertically at each alpha
value,
the lower point on the segment is a high probability lower bound of the
(1-type II error) of the NP classifier corresponding to that
alpha
value, and the the upper point on the segment is a
high probability upper bound of the (1-type II error) of the NP
classifier corresponding to that alpha
value.
In the following demo, we use the same data as in the previous section.
## Warning in regularize.values(x, y, ties, missing(ties), na.rm = na.rm):
## collapsing to unique 'x' values
## Warning in regularize.values(x, y, ties, missing(ties), na.rm = na.rm):
## collapsing to unique 'x' values
Note that, to create an NP-ROC band, there is no need to construct an
NP classifier using the npc
function first.
When we choose linear discriminant analysis as the base method, we do as follows.
## Warning in regularize.values(x, y, ties, missing(ties), na.rm = na.rm):
## collapsing to unique 'x' values
## Warning in regularize.values(x, y, ties, missing(ties), na.rm = na.rm):
## collapsing to unique 'x' values
The compare
function compares two NP classifiers at
different Ξ± values (their
Ξ΄ values have to be the same)
by inespecting their NP-ROC bands. The following code displayes the
NP-ROC bands generated by logistic regression with the two predictors.
The region marked in red is the alpha
range in which the
second NP classifier dominates the first, and the black region means
otherwise. The uncolored region means we cannot claim supriority of one
or the other with confidence.
n = 1000
set.seed(0)
x1 = c(rnorm(n), rnorm(n) + 1)
x2 = c(rnorm(n), rnorm(n)*sqrt(6) + 1)
y = c(rep(0,n), rep(1,n))
fit1 = nproc(x1, y, split = 11, method = 'lda')
## Warning in regularize.values(x, y, ties, missing(ties), na.rm = na.rm):
## collapsing to unique 'x' values
## Warning in regularize.values(x, y, ties, missing(ties), na.rm = na.rm):
## collapsing to unique 'x' values
## Warning in regularize.values(x, y, ties, missing(ties), na.rm = na.rm):
## collapsing to unique 'x' values
## Warning in regularize.values(x, y, ties, missing(ties), na.rm = na.rm):
## collapsing to unique 'x' values
## Warning in regularize.values(x, y, ties, missing(ties), na.rm = na.rm):
## collapsing to unique 'x' values
## Warning in regularize.values(x, y, ties, missing(ties), na.rm = na.rm):
## collapsing to unique 'x' values
## Warning in regularize.values(x, y, ties, missing(ties), na.rm = na.rm):
## collapsing to unique 'x' values
## Warning in regularize.values(x, y, ties, missing(ties), na.rm = na.rm):
## collapsing to unique 'x' values
## Warning in regularize.values(x, y, ties, missing(ties), na.rm = na.rm):
## collapsing to unique 'x' values
## Warning in regularize.values(x, y, ties, missing(ties), na.rm = na.rm):
## collapsing to unique 'x' values
## Warning in regularize.values(x, y, ties, missing(ties), na.rm = na.rm):
## collapsing to unique 'x' values
## Warning in regularize.values(x, y, ties, missing(ties), na.rm = na.rm):
## collapsing to unique 'x' values
## Warning in regularize.values(x, y, ties, missing(ties), na.rm = na.rm):
## collapsing to unique 'x' values
## Warning in regularize.values(x, y, ties, missing(ties), na.rm = na.rm):
## collapsing to unique 'x' values
## Warning in regularize.values(x, y, ties, missing(ties), na.rm = na.rm):
## collapsing to unique 'x' values
## Warning in regularize.values(x, y, ties, missing(ties), na.rm = na.rm):
## collapsing to unique 'x' values
## Warning in regularize.values(x, y, ties, missing(ties), na.rm = na.rm):
## collapsing to unique 'x' values
## Warning in regularize.values(x, y, ties, missing(ties), na.rm = na.rm):
## collapsing to unique 'x' values
## Warning in regularize.values(x, y, ties, missing(ties), na.rm = na.rm):
## collapsing to unique 'x' values
## Warning in regularize.values(x, y, ties, missing(ties), na.rm = na.rm):
## collapsing to unique 'x' values
## Warning in regularize.values(x, y, ties, missing(ties), na.rm = na.rm):
## collapsing to unique 'x' values
## Warning in regularize.values(x, y, ties, missing(ties), na.rm = na.rm):
## collapsing to unique 'x' values
## Warning in regularize.values(x, y, ties, missing(ties), na.rm = na.rm):
## collapsing to unique 'x' values
## Warning in regularize.values(x, y, ties, missing(ties), na.rm = na.rm):
## collapsing to unique 'x' values
## Warning in regularize.values(x, y, ties, missing(ties), na.rm = na.rm):
## collapsing to unique 'x' values
## Warning in regularize.values(x, y, ties, missing(ties), na.rm = na.rm):
## collapsing to unique 'x' values
## Warning in regularize.values(x, y, ties, missing(ties), na.rm = na.rm):
## collapsing to unique 'x' values
## Warning in regularize.values(x, y, ties, missing(ties), na.rm = na.rm):
## collapsing to unique 'x' values
## Warning in regularize.values(x, y, ties, missing(ties), na.rm = na.rm):
## collapsing to unique 'x' values
## Warning in regularize.values(x, y, ties, missing(ties), na.rm = na.rm):
## collapsing to unique 'x' values
## Warning in regularize.values(x, y, ties, missing(ties), na.rm = na.rm):
## collapsing to unique 'x' values
## Warning in regularize.values(x, y, ties, missing(ties), na.rm = na.rm):
## collapsing to unique 'x' values
## Warning in regularize.values(x, y, ties, missing(ties), na.rm = na.rm):
## collapsing to unique 'x' values
## Warning in regularize.values(x, y, ties, missing(ties), na.rm = na.rm):
## collapsing to unique 'x' values
## Warning in regularize.values(x, y, ties, missing(ties), na.rm = na.rm):
## collapsing to unique 'x' values
## Warning in regularize.values(x, y, ties, missing(ties), na.rm = na.rm):
## collapsing to unique 'x' values
## Warning in regularize.values(x, y, ties, missing(ties), na.rm = na.rm):
## collapsing to unique 'x' values
## Warning in regularize.values(x, y, ties, missing(ties), na.rm = na.rm):
## collapsing to unique 'x' values
## Warning in regularize.values(x, y, ties, missing(ties), na.rm = na.rm):
## collapsing to unique 'x' values
## Warning in regularize.values(x, y, ties, missing(ties), na.rm = na.rm):
## collapsing to unique 'x' values
## Warning in regularize.values(x, y, ties, missing(ties), na.rm = na.rm):
## collapsing to unique 'x' values
## Warning in regularize.values(x, y, ties, missing(ties), na.rm = na.rm):
## collapsing to unique 'x' values
## Warning in regularize.values(x, y, ties, missing(ties), na.rm = na.rm):
## collapsing to unique 'x' values
## Warning in regularize.values(x, y, ties, missing(ties), na.rm = na.rm):
## collapsing to unique 'x' values
v = compare(fit1, fit2)
legend('topleft',legend = c(expression(X[1]),expression(X[2])),col = 1:2,lty = c(1,1))
Lastly, we have implemented a web-based interface using the shiny app. It can be invoked by the following command after the package is loaded.
shiny::runApp(system.file('nproc', package='nproc'))