Package 'nproc' reference manual

Title:	Neyman-Pearson (NP) Classification Algorithms and NP Receiver Operating Characteristic (NP-ROC) Curves
Description:	In many binary classification applications, such as disease diagnosis and spam detection, practitioners commonly face the need to limit type I error (i.e., the conditional probability of misclassifying a class 0 observation as class 1) so that it remains below a desired threshold. To address this need, the Neyman-Pearson (NP) classification paradigm is a natural choice; it minimizes type II error (i.e., the conditional probability of misclassifying a class 1 observation as class 0) while enforcing an upper bound, alpha, on the type I error. Although the NP paradigm has a century-long history in hypothesis testing, it has not been well recognized and implemented in classification schemes. Common practices that directly limit the empirical type I error to no more than alpha do not satisfy the type I error control objective because the resulting classifiers are still likely to have type I errors much larger than alpha. As a result, the NP paradigm has not been properly implemented for many classification scenarios in practice. In this work, we develop the first umbrella algorithm that implements the NP paradigm for all scoring-type classification methods, including popular methods such as logistic regression, support vector machines and random forests. Powered by this umbrella algorithm, we propose a novel graphical tool for NP classification methods: NP receiver operating characteristic (NP-ROC) bands, motivated by the popular receiver operating characteristic (ROC) curves. NP-ROC bands will help choose in a data adaptive way and compare different NP classifiers.
Authors:	Yang Feng [aut, cre], Jessica Li [aut], Xin Tong [aut], Ye Tian [ctb]
Maintainer:	Yang Feng <[email protected]>
License:	GPL-2
Version:	2.1.5
Built:	2025-02-12 04:56:38 UTC
Source:	https://github.com/cran/nproc

Compare two NP classification methods at different type I error upper bounds.

Description

compare compares NP classification methods and provides the regions where one method is better than the other.

Usage

compare(roc1, roc2, plot = TRUE, col1 = "black", col2 = "red")
compare(roc1, roc2, plot = TRUE, col1 = "black", col2 = "red")

Arguments

`roc1`	the first nproc object.
`roc2`	the second nproc object.
`plot`	whether to generate the two NP-ROC plots and mark the area of significant difference. Default = 'TRUE'.
`col1`	the color of the region where roc1 is significantly better than roc2. Default = 'black'.
`col2`	the color of the region where roc2 is significantly better than roc1. Default = 'red'.

Value

A list with the following items.

`alpha1`	the alpha values where roc1 is significantly better than roc2.
`alpha2`	the alpha values where roc2 is significantly better than roc1.
`alpha3`	the alpha values where roc1 and roc2 are not significantly different.

References

Xin Tong, Yang Feng, and Jingyi Jessica Li (2018), Neyman-Pearson (NP) classification algorithms and NP receiver operating characteristic (NP-ROC), Science Advances, 4, 2, eaao1659.

Examples

n = 1000
set.seed(1)
x1 = c(rnorm(n), rnorm(n) + 1)
x2 = c(rnorm(n), rnorm(n)*sqrt(6) + 1)
y = c(rep(0,n), rep(1,n))
fit1 = nproc(x1, y, method = 'lda')
fit2 = nproc(x2, y, method = 'lda')
v = compare(fit1, fit2)
legend('topleft',legend=c('x1','x2'),col=1:2,lty=c(1,1))

n = 1000
set.seed(1)
x1 = c(rnorm(n), rnorm(n) + 1)
x2 = c(rnorm(n), rnorm(n)*sqrt(6) + 1)
y = c(rep(0,n), rep(1,n))
fit1 = nproc(x1, y, method = 'lda')
fit2 = nproc(x2, y, method = 'lda')
v = compare(fit1, fit2)
legend('topleft',legend=c('x1','x2'),col=1:2,lty=c(1,1))

Add NP-ROC curves to the current plot object.

Description

Add NP-ROC curves to the current plot object.

Usage

## S3 method for class 'nproc'
lines(x, ...)
## S3 method for class 'nproc'
lines(x, ...)

Arguments

`x`	fitted NP-ROC object using `nproc`.
`...`	additional arguments.

Examples

n = 1000
x = matrix(rnorm(n*2),n,2)
c = 1+3*x[,1]
y = rbinom(n,1,1/(1+exp(-c)))
fit = nproc(x, y, method = 'nb')
plot(fit)
fit2 = nproc(x, y, method = 'lda')
lines(fit2, col = 2)
n = 1000
x = matrix(rnorm(n*2),n,2)
c = 1+3*x[,1]
y = rbinom(n,1,1/(1+exp(-c)))
fit = nproc(x, y, method = 'nb')
plot(fit)
fit2 = nproc(x, y, method = 'lda')
lines(fit2, col = 2)

Construct a Neyman-Pearson Classifier from a sample of class 0 and class 1.

Description

Given a type I error upper bound alpha and a violation upper bound delta, npc calculates the Neyman-Pearson Classifier which controls the type I error under alpha with probability at least 1-delta.

Usage

npc(x = NULL, y, method = c("logistic", "penlog", "svm", "randomforest",
  "lda", "slda", "nb", "nnb", "ada", "tree"), alpha = 0.05, delta = 0.05,
  split = 1, split.ratio = 0.5, n.cores = 1, band = FALSE,
  nfolds = 10, randSeed = 0, warning = TRUE, ...)
npc(x = NULL, y, method = c("logistic", "penlog", "svm", "randomforest",
  "lda", "slda", "nb", "nnb", "ada", "tree"), alpha = 0.05, delta = 0.05,
  split = 1, split.ratio = 0.5, n.cores = 1, band = FALSE,
  nfolds = 10, randSeed = 0, warning = TRUE, ...)

Arguments

`x`	n * p observation matrix. n observations, p covariates.
`y`	n 0/1 observatons.
`method`	base classification method. logistic: Logistic regression. glm function with family = 'binomial' penlog: Penalized logistic regression with LASSO penalty. `glmnet` in `glmnet` package svm: Support Vector Machines. `svm` in `e1071` package randomforest: Random Forest. `randomForest` in `randomForest` package lda: Linear Discriminant Analysis. `lda` in `MASS` package slda: Sparse Linear Discriminant Analysis with LASSO penalty. nb: Naive Bayes. `naiveBayes` in `e1071` package nnb: Nonparametric Naive Bayes. `naive_bayes` in `naivebayes` package ada: Ada-Boost. `ada` in `ada` package
`alpha`	the desirable upper bound on type I error. Default = 0.05.
`delta`	the violation rate of the type I error. Default = 0.05.
`split`	the number of splits for the class 0 sample. Default = 1. For ensemble version, choose split > 1.
`split.ratio`	the ratio of splits used for the class 0 sample to train the base classifier. The rest are used to estimate the threshold. Can also be set to be "adaptive", which will be determined using a data-driven method implemented in `find.optim.split`. Default = 0.5.
`n.cores`	number of cores used for parallel computing. Default = 1. WARNING: windows machine is not supported.
`band`	whether to generate both lower and upper bounds of type II error. Default = FALSE.
`nfolds`	number of folds for performing adaptive split ratio selection. Default = 10.
`randSeed`	the random seed used in the algorithm.
`warning`	whether to show various warnings in the program. Default = TRUE.
`...`	additional arguments.

Value

An object with S3 class npc.

`fits`	a list of length max(1,split), represents the fit during each split.
`method`	the base classification method.
`split`	the number of splits used.

References

Xin Tong, Yang Feng, and Jingyi Jessica Li (2018), Neyman-Pearson (NP) classification algorithms and NP receiver operating characteristic (NP-ROC), Science Advances, 4, 2, eaao1659.

Examples

set.seed(1)
n = 1000
x = matrix(rnorm(n*2),n,2)
c = 1+3*x[,1]
y = rbinom(n,1,1/(1+exp(-c)))
xtest = matrix(rnorm(n*2),n,2)
ctest = 1+3*xtest[,1]
ytest = rbinom(n,1,1/(1+exp(-ctest)))

##Use lda classifier and the default type I error control with alpha=0.05, delta=0.05
fit = npc(x, y, method = 'lda')
pred = predict(fit,xtest)
fit.score = predict(fit,x)
accuracy = mean(pred$pred.label==ytest)
cat('Overall Accuracy: ',  accuracy,'\n')
ind0 = which(ytest==0)
typeI = mean(pred$pred.label[ind0]!=ytest[ind0]) #type I error on test set
cat('Type I error: ', typeI, '\n')

## Not run: 
##Ensembled lda classifier with split = 11,  alpha=0.05, delta=0.05
fit = npc(x, y, method = 'lda', split = 11)
pred = predict(fit,xtest)
accuracy = mean(pred$pred.label==ytest)
cat('Overall Accuracy: ',  accuracy,'\n')
ind0 = which(ytest==0)
typeI = mean(pred$pred.label[ind0]!=ytest[ind0]) #type I error on test set
cat('Type I error: ', typeI, '\n')

##Now, change the method to logistic regression and change alpha to 0.1
fit = npc(x, y, method = 'logistic', alpha = 0.1)
pred = predict(fit,xtest)
accuracy = mean(pred$pred.label==ytest)
cat('Overall Accuracy: ',  accuracy,'\n')
ind0 = which(ytest==0)
typeI = mean(pred$pred.label[ind0]!=ytest[ind0]) #type I error on test set
cat('Type I error: ', typeI, '\n')

##Now, change the method to adaboost
fit = npc(x, y, method = 'ada', alpha = 0.1)
pred = predict(fit,xtest)
accuracy = mean(pred$pred.label==ytest)
cat('Overall Accuracy: ',  accuracy,'\n')
ind0 = which(ytest==0)
typeI = mean(pred$pred.label[ind0]!=ytest[ind0]) #type I error on test set
cat('Type I error: ', typeI, '\n')

##Now, try the adaptive splitting ratio
fit = npc(x, y, method = 'ada', alpha = 0.1, split.ratio = 'adaptive')
pred = predict(fit,xtest)
accuracy = mean(pred$pred.label==ytest)
cat('Overall Accuracy: ',  accuracy,'\n')
ind0 = which(ytest==0)
typeI = mean(pred$pred.label[ind0]!=ytest[ind0]) #type I error on test set
cat('Type I error: ', typeI, '\n')
cat('Splitting ratio:', fit$split.ratio)

## End(Not run)
set.seed(1)
n = 1000
x = matrix(rnorm(n*2),n,2)
c = 1+3*x[,1]
y = rbinom(n,1,1/(1+exp(-c)))
xtest = matrix(rnorm(n*2),n,2)
ctest = 1+3*xtest[,1]
ytest = rbinom(n,1,1/(1+exp(-ctest)))

##Use lda classifier and the default type I error control with alpha=0.05, delta=0.05
fit = npc(x, y, method = 'lda')
pred = predict(fit,xtest)
fit.score = predict(fit,x)
accuracy = mean(pred$pred.label==ytest)
cat('Overall Accuracy: ',  accuracy,'\n')
ind0 = which(ytest==0)
typeI = mean(pred$pred.label[ind0]!=ytest[ind0]) #type I error on test set
cat('Type I error: ', typeI, '\n')

## Not run: 
##Ensembled lda classifier with split = 11,  alpha=0.05, delta=0.05
fit = npc(x, y, method = 'lda', split = 11)
pred = predict(fit,xtest)
accuracy = mean(pred$pred.label==ytest)
cat('Overall Accuracy: ',  accuracy,'\n')
ind0 = which(ytest==0)
typeI = mean(pred$pred.label[ind0]!=ytest[ind0]) #type I error on test set
cat('Type I error: ', typeI, '\n')

##Now, change the method to logistic regression and change alpha to 0.1
fit = npc(x, y, method = 'logistic', alpha = 0.1)
pred = predict(fit,xtest)
accuracy = mean(pred$pred.label==ytest)
cat('Overall Accuracy: ',  accuracy,'\n')
ind0 = which(ytest==0)
typeI = mean(pred$pred.label[ind0]!=ytest[ind0]) #type I error on test set
cat('Type I error: ', typeI, '\n')

##Now, change the method to adaboost
fit = npc(x, y, method = 'ada', alpha = 0.1)
pred = predict(fit,xtest)
accuracy = mean(pred$pred.label==ytest)
cat('Overall Accuracy: ',  accuracy,'\n')
ind0 = which(ytest==0)
typeI = mean(pred$pred.label[ind0]!=ytest[ind0]) #type I error on test set
cat('Type I error: ', typeI, '\n')

##Now, try the adaptive splitting ratio
fit = npc(x, y, method = 'ada', alpha = 0.1, split.ratio = 'adaptive')
pred = predict(fit,xtest)
accuracy = mean(pred$pred.label==ytest)
cat('Overall Accuracy: ',  accuracy,'\n')
ind0 = which(ytest==0)
typeI = mean(pred$pred.label[ind0]!=ytest[ind0]) #type I error on test set
cat('Type I error: ', typeI, '\n')
cat('Splitting ratio:', fit$split.ratio)

## End(Not run)

Calculate the Neyman-Pearson Receiver Operating Characteristics

Description

nproc calculates the Neyman-Pearson Receiver Operating Characteristics band for a given sequence of type I error values.

Usage

nproc(x = NULL, y, method = c("logistic", "penlog", "svm", "randomforest",
  "lda", "nb", "nnb", "ada", "tree"), delta = 0.05, split = 1,
  split.ratio = 0.5, n.cores = 1, randSeed = 0, ...)
nproc(x = NULL, y, method = c("logistic", "penlog", "svm", "randomforest",
  "lda", "nb", "nnb", "ada", "tree"), delta = 0.05, split = 1,
  split.ratio = 0.5, n.cores = 1, randSeed = 0, ...)

Arguments

`x`	n * p observation matrix. n observations, p covariates.
`y`	n 0/1 observatons.
`method`	base classification method(s). logistic: Logistic regression. glm function with family = 'binomial' penlog: Penalized logistic regression with LASSO penalty. `glmnet` in `glmnet` package svm: Support Vector Machines. `svm` in `e1071` package randomforest: Random Forest. `randomForest` in `randomForest` package Linear Discriminant Analysis. lda: `lda` in `MASS` package nb: Naive Bayes. `naiveBayes` in `e1071` package nnb: Nonparametric Naive Bayes. `naive_bayes` in `naivebayes` package ada: Ada-Boost. `ada` in `ada` package
`delta`	the violation rate of the type I error. Default = 0.05.
`split`	the number of splits for the class 0 sample. Default = 1. For ensemble version, choose split > 1.
`split.ratio`	the ratio of splits used for the class 0 sample to train the classifier. Default = 0.5.
`n.cores`	number of cores used for parallel computing. Default = 1.
`randSeed`	the random seed used in the algorithm.
`...`	additional arguments.

Value

An object with S3 class nproc.

`typeI.u`	sequence of upper bound of type I error.
`typeII.l`	sequence of lower bound of type II error.
`typeII.u`	sequence of upper bound of type II error.
`auc.l`	the auc value of the lower NP-ROC curve.
`auc.u`	the auc value of the upper NP-ROC curve.
`method`	the base classification method implemented.
`delta`	the violation rate.

References

Xin Tong, Yang Feng, and Jingyi Jessica Li (2018), Neyman-Pearson (NP) classification algorithms and NP receiver operating characteristic (NP-ROC), Science Advances, 4, 2, eaao1659.

Examples

n = 200
x = matrix(rnorm(n*2),n,2)
c = 1 - 3*x[,1]
y = rbinom(n,1,1/(1+exp(-c)))
#fit = nproc(x, y, method = 'svm')
fit2 = nproc(x, y, method = 'penlog')
##Plot the nproc curve
plot(fit2)

## Not run: 
fit3 = nproc(x, y, method = 'penlog',  n.cores = 2)
#In practice, replace 2 by the number of cores available 'detectCores()'
fit4 = nproc(x, y, method = 'penlog', n.cores = detectCores())

#Confidence nproc curves
fit6 = nproc(x, y, method = 'lda')
plot(fit6)
nproc ensembled version
fit7 = nproc(x, y, method = 'lda', split = 11)
plot(fit7)

## End(Not run)

n = 200
x = matrix(rnorm(n*2),n,2)
c = 1 - 3*x[,1]
y = rbinom(n,1,1/(1+exp(-c)))
#fit = nproc(x, y, method = 'svm')
fit2 = nproc(x, y, method = 'penlog')
##Plot the nproc curve
plot(fit2)

## Not run: 
fit3 = nproc(x, y, method = 'penlog',  n.cores = 2)
#In practice, replace 2 by the number of cores available 'detectCores()'
fit4 = nproc(x, y, method = 'penlog', n.cores = detectCores())

#Confidence nproc curves
fit6 = nproc(x, y, method = 'lda')
plot(fit6)
nproc ensembled version
fit7 = nproc(x, y, method = 'lda', split = 11)
plot(fit7)

## End(Not run)

Plot the nproc band(s).

Description

Plot the nproc band(s).

Usage

## S3 method for class 'nproc'
plot(x, ...)
## S3 method for class 'nproc'
plot(x, ...)

Arguments

`x`	fitted nproc object using `nproc`.
`...`	additional arguments.

Examples

n = 1000
x = matrix(rnorm(n*2),n,2)
c = 1+3*x[,1]
y = rbinom(n,1,1/(1+exp(-c)))
fit = nproc(x, y, method = 'lda')
plot(fit)
n = 1000
x = matrix(rnorm(n*2),n,2)
c = 1+3*x[,1]
y = rbinom(n,1,1/(1+exp(-c)))
fit = nproc(x, y, method = 'lda')
plot(fit)

Predicting the outcome of a set of new observations using the fitted npc object.

Description

Predicting the outcome of a set of new observations using the fitted npc object.

Usage

## S3 method for class 'npc'
predict(object, newx = NULL, ...)
## S3 method for class 'npc'
predict(object, newx = NULL, ...)

Arguments

`object`	fitted npc object using `npc`.
`newx`	a set of new observations.
`...`	additional arguments.

Value

A list containing the predicted label and score.

`pred.label`	Predicted label vector.
`pred.score`	Predicted score vector.

Examples

n = 1000
x = matrix(rnorm(n*2),n,2)
c = 1+3*x[,1]
y = rbinom(n,1,1/(1+exp(-c)))
xtest = matrix(rnorm(n*2),n,2)
ctest = 1+3*xtest[,1]
ytest = rbinom(n,1,1/(1+exp(-ctest)))


## Not run: 
##Use logistic classifier and the default type I error control with alpha=0.05
fit = npc(x, y, method = 'logistic')
pred = predict(fit,xtest)
fit.score = predict(fit,x)
accuracy = mean(pred$pred.label==ytest)
cat('Overall Accuracy: ',  accuracy,'\n')
ind0 = which(ytest==0)
ind1 = which(ytest==1)
typeI = mean(pred$pred.label[ind0]!=ytest[ind0]) #type I error on test set
cat('Type I error: ', typeI, '\n')
typeII = mean(pred$pred.label[ind1]!=ytest[ind1]) #type II error on test set
cat('Type II error: ', typeII, '\n')

## End(Not run)
n = 1000
x = matrix(rnorm(n*2),n,2)
c = 1+3*x[,1]
y = rbinom(n,1,1/(1+exp(-c)))
xtest = matrix(rnorm(n*2),n,2)
ctest = 1+3*xtest[,1]
ytest = rbinom(n,1,1/(1+exp(-ctest)))


## Not run: 
##Use logistic classifier and the default type I error control with alpha=0.05
fit = npc(x, y, method = 'logistic')
pred = predict(fit,xtest)
fit.score = predict(fit,x)
accuracy = mean(pred$pred.label==ytest)
cat('Overall Accuracy: ',  accuracy,'\n')
ind0 = which(ytest==0)
ind1 = which(ytest==1)
typeI = mean(pred$pred.label[ind0]!=ytest[ind0]) #type I error on test set
cat('Type I error: ', typeI, '\n')
typeII = mean(pred$pred.label[ind1]!=ytest[ind1]) #type II error on test set
cat('Type II error: ', typeII, '\n')

## End(Not run)

Print the npc object.

Description

Print the npc object.

Usage

## S3 method for class 'npc'
print(x, ...)
## S3 method for class 'npc'
print(x, ...)

Arguments

`x`	fitted npc object using `npc`.
`...`	additional arguments.

Examples

n = 1000
x = matrix(rnorm(n*2),n,2)
c = 1+3*x[,1]
y = rbinom(n,1,1/(1+exp(-c)))
fit = npc(x, y, method = 'lda')
print(fit)
n = 1000
x = matrix(rnorm(n*2),n,2)
c = 1+3*x[,1]
y = rbinom(n,1,1/(1+exp(-c)))
fit = npc(x, y, method = 'lda')
print(fit)

Print the nproc object.

Description

Print the nproc object.

Usage

## S3 method for class 'nproc'
print(x, ...)
## S3 method for class 'nproc'
print(x, ...)

Arguments

`x`	fitted nproc object using `nproc`.
`...`	additional arguments.

Examples

n = 1000
x = matrix(rnorm(n*2),n,2)
c = 1+3*x[,1]
y = rbinom(n,1,1/(1+exp(-c)))
fit = nproc(x, y, method = 'lda')
print(fit)
n = 1000
x = matrix(rnorm(n*2),n,2)
c = 1+3*x[,1]
y = rbinom(n,1,1/(1+exp(-c)))
fit = nproc(x, y, method = 'lda')
print(fit)

Calculate the Receiver Operating Characteristics with Cross-validation or Subsampling

Description

rocCV calculates the receiver operating characterisitc with cross-validation

Usage

rocCV(x = NULL, y, method = c("logistic", "penlog", "svm", "randomforest",
  "lda", "nb", "ada", "tree"), metric = "CV", n.folds = 5,
  train.frac = 0.5, n.cores = 1, randSeed = 0, ...)
rocCV(x = NULL, y, method = c("logistic", "penlog", "svm", "randomforest",
  "lda", "nb", "ada", "tree"), metric = "CV", n.folds = 5,
  train.frac = 0.5, n.cores = 1, randSeed = 0, ...)

Arguments

`x`	n * p observation matrix. n observations, p covariates.
`y`	n 0/1 observatons.
`method`	classification method(s). logistic: Logistic regression. glm function with family = 'binomial' penlog: Penalized logistic regression with LASSO penalty. `glmnet` in `glmnet` package svm: Support Vector Machines. `svm` in `e1071` package randomforest: Random Forest. `randomForest` in `randomForest` package Linear Discriminant Analysis. lda: `lda` in `MASS` package nb: Naive Bayes. `naiveBayes` in `e1071` package ada: Ada-Boost. `ada` in `ada` package
`metric`	metric used for averging performance. Includes 'CV' and 'SS' as options. Default = 'CV'.
`n.folds`	number of folds used for cross-validation or the number of splits in the subsampling. Default = 5.
`train.frac`	fraction of training data in the subsampling process. Default = 0.5.
`n.cores`	number of cores used for parallel computing. Default = 1.
`randSeed`	the random seed used in the algorithm. Default = 0.
`...`	additional arguments.

Value

A list.

`fpr`	sequence of false positive rate.
`tpr`	sequence of true positive rate.

References

Xin Tong, Yang Feng, and Jingyi Jessica Li (2018), Neyman-Pearson (NP) classification algorithms and NP receiver operating characteristic (NP-ROC), Science Advances, 4, 2, eaao1659.

Examples

n = 200
x = matrix(rnorm(n*2),n,2)
c = 1 - 3*x[,1]
y = rbinom(n,1,1/(1+exp(-c)))
fit = rocCV(x, y, method = 'svm')
fit2 = rocCV(x, y, method = 'penlog')
fit3 = rocCV(x, y, method = 'penlog', metric = 'SS')
n = 200
x = matrix(rnorm(n*2),n,2)
c = 1 - 3*x[,1]
y = rbinom(n,1,1/(1+exp(-c)))
fit = rocCV(x, y, method = 'svm')
fit2 = rocCV(x, y, method = 'penlog')
fit3 = rocCV(x, y, method = 'penlog', metric = 'SS')

Package 'nproc'

Help Index

Compare two NP classification methods at different type I error upper bounds.

Description

Usage

Arguments

Value

References

See Also

Examples

Add NP-ROC curves to the current plot object.

Description

Usage

Arguments

See Also

Examples

Construct a Neyman-Pearson Classifier from a sample of class 0 and class 1.

Description

Usage

Arguments

Value

References

See Also

Examples

Calculate the Neyman-Pearson Receiver Operating Characteristics

Description

Usage

Arguments

Value

References

See Also

Examples

Plot the nproc band(s).

Description

Usage

Arguments

See Also

Examples

Predicting the outcome of a set of new observations using the fitted npc object.

Description

Usage

Arguments

Value

See Also

Examples

Print the npc object.

Description

Usage

Arguments

See Also

Examples

Print the nproc object.

Description

Usage

Arguments

See Also

Examples

Calculate the Receiver Operating Characteristics with Cross-validation or Subsampling

Description

Usage

Arguments

Value

References

See Also

Examples