Title: | Decision Forest |
---|---|
Description: | Provides R-implementation of Decision forest algorithm, which combines the predictions of multiple independent decision tree models for a consensus decision. In particular, Decision Forest is a novel pattern-recognition method which can be used to analyze: (1) DNA microarray data; (2) Surface-Enhanced Laser Desorption/Ionization Time-of-Flight Mass Spectrometry (SELDI-TOF-MS) data; and (3) Structure-Activity Relation (SAR) data. In this package, three fundamental functions are provided, as (1)DF_train, (2)DF_pred, and (3)DF_CV. run Dforest() to see more instructions. Weida Tong (2003) <doi:10.1021/ci020058s>. |
Authors: | Leihong Wu <[email protected]>, Weida Tong ([email protected]) |
Maintainer: | Leihong Wu <[email protected]> |
License: | GPL-2 |
Version: | 0.4.2 |
Built: | 2025-03-02 03:18:12 UTC |
Source: | https://github.com/cran/Dforest |
Performance evaluation from other modeling algorithm Result
cal_MCC(pred, label)
cal_MCC(pred, label)
pred |
Predictions |
label |
Known-endpoint |
result$ACC: Predicting Accuracy
result$MIS: MisClassfication Counts
result$MCC: Matthew's Correlation Coefficients
result$bACC: balanced Accuracy
Construct Decision Tree model with pruning
Con_DT(X, Y, min_split = 10, cp = 0.01)
Con_DT(X, Y, min_split = 10, cp = 0.01)
X |
dataset |
Y |
data_Labels |
min_split |
minimum number of node in each leaf |
cp |
pre-defined Complexity Parameter (CP) rpart program |
Decision Tree Model with pruning Implemented by rpart
rpart
This data set gives the DILI endpoint of various compounds (Most or No DILI-concern) with QSAR descriptors generated by MOLD2
rivers
rivers
A List containing two vectors: X contains 958 observations and 777 variables. Y contains DILI endpoints of 958 observations
In-house data
Minjun Chen (2011) FDA-approved drug labeling for the study of drug-induced liver injury. Drug discovery today
Performance evaluation from Decision Tree Predictions
DF_acc(pred, label)
DF_acc(pred, label)
pred |
Predictions |
label |
Known-endpoint |
result$ACC: Predicting Accuracy
result$MIS: MisClassfication Counts
result$MCC: Matthew's Correlation Coefficients
result$bACC: balanced Accuracy
T-test for feature selection
DF_calp(X, Y)
DF_calp(X, Y)
X |
X variable matrix |
Y |
Y label |
Draw accuracy curve according to the confidence level of predictions
DF_ConfPlot(Pred_result, Label, bin = 20, plot = T, smooth = F)
DF_ConfPlot(Pred_result, Label, bin = 20, plot = T, smooth = F)
Pred_result |
Predictions |
Label |
known label for Test Dataset |
bin |
How many bins occurred in Conf Plot (Default is 20) |
plot |
Draw Plot if True, otherwise output the datamatrix |
smooth |
if TRUE, Fit the performance curve with smooth function (by ggplot2) |
ACC_Conf: return data Matrix ("ConfidenceLevel", "Accuracy", "Matched Samples") for confidence plot (no plot)
ConfPlot: Draw Confidence Plot if True, need install ggplot2
Draw accuracy curve according to the confidence level of predictions
DF_ConfPlot_accu(Pred_result, Label, bin = 20, plot = T, smooth = F)
DF_ConfPlot_accu(Pred_result, Label, bin = 20, plot = T, smooth = F)
Pred_result |
Predictions |
Label |
known label for Test Dataset |
bin |
How many bins occurred in Conf Plot (Default is 20) |
plot |
Draw Plot if True, otherwise output the datamatrix |
smooth |
if TRUE, Fit the performance curve with smooth function (by ggplot2) |
ACC_Conf: return data Matrix ("ConfidenceLevel", "Accuracy", "Matched Samples") for confidence plot (no plot)
ConfPlot: Draw Confidence Plot if True, need install ggplot2
Decision Forest algorithm: Model training with Cross-validation Default is 5-fold cross-validation
DF_CV(X, Y, stop_step = 10, CV_fold = 5, Max_tree = 20, min_split = 10, cp = 0.1, Filter = F, p_val = 0.05, Method = "bACC", Quiet = T, Grace_val = 0.05, imp_accu_val = 0.01, imp_accu_criteria = F)
DF_CV(X, Y, stop_step = 10, CV_fold = 5, Max_tree = 20, min_split = 10, cp = 0.1, Filter = F, p_val = 0.05, Method = "bACC", Quiet = T, Grace_val = 0.05, imp_accu_val = 0.01, imp_accu_criteria = F)
X |
Training Dataset |
Y |
Training data endpoint |
stop_step |
How many extra step would be processed when performance not improved, 1 means one extra step |
CV_fold |
Fold of cross-validation (Default = 5) |
Max_tree |
Maximum tree number in Forest |
min_split |
minimum leaves in tree nodes |
cp |
parameters to pruning decision tree, default is 0.1 |
Filter |
doing feature selection before training |
p_val |
P-value threshold measured by t-test used in feature selection, default is 0.05 |
Method |
Which is used for evaluating training process. MIS: Misclassification rate; ACC: accuracy |
Quiet |
if TRUE (default), don't show any message during the process |
Grace_val |
Grace Value in evaluation: the next model should have a performance (Accuracy, bACC, MCC) not bad than previous model with threshold |
imp_accu_val |
improvement in evaluation: adding new tree should improve the overall model performance (Accuracy, bACC, MCC) by threshold |
imp_accu_criteria |
if TRUE, model must have improvement in accumulated accuracy |
.$performance: Overall training accuracy (Cross-validation)
.$pred: Detailed training prediction (Cross-validation)
.$detail: Detailed usage of Decision tree Features/Models and their performances in all CVs
.$Method: pass evaluating Methods used in training
.$cp: pass cp value used in training decision trees
##data(iris) X = iris[,1:4] Y = iris[,5] names(Y)=rownames(X) random_seq=sample(nrow(X)) split_rate=3 split_sample = suppressWarnings(split(random_seq,1:split_rate)) Train_X = X[-random_seq[split_sample[[1]]],] Train_Y = Y[-random_seq[split_sample[[1]]]] CV_result = DF_CV(Train_X, Train_Y)
##data(iris) X = iris[,1:4] Y = iris[,5] names(Y)=rownames(X) random_seq=sample(nrow(X)) split_rate=3 split_sample = suppressWarnings(split(random_seq,1:split_rate)) Train_X = X[-random_seq[split_sample[[1]]],] Train_Y = Y[-random_seq[split_sample[[1]]]] CV_result = DF_CV(Train_X, Train_Y)
Draw plot for Dforest Cross-validation results
DF_CVsummary(CV_result, plot = T)
DF_CVsummary(CV_result, plot = T)
CV_result |
Training Dataset |
plot |
if TRUE (default), draw plot |
Decision Forest algorithm: feature selection for two-class predictions, kept statistical significant features pass the t-test
DF_dataFs(X, Y, p_val = 0.05)
DF_dataFs(X, Y, p_val = 0.05)
X |
Training Dataset |
Y |
Training Labels |
p_val |
Correlation Coefficient threshold to filter out high correlated features; default is 0.95 |
Keep_feat: qualified features in data matrix after filtering
##data(iris) X = iris[iris[,5]!="setosa",1:4] Y = iris[iris[,5]!="setosa",5] used_feat = DF_dataFs(X, Y)
##data(iris) X = iris[iris[,5]!="setosa",1:4] Y = iris[iris[,5]!="setosa",5] used_feat = DF_dataFs(X, Y)
Decision Forest algorithm: Data pre-processing, remove All-Zero columns/features and high correlated features
DF_dataPre(X, thres = 0.95)
DF_dataPre(X, thres = 0.95)
X |
Training Dataset |
thres |
Correlation Coefficient threshold to filter out high correlated features; default is 0.95 |
Keep_feat: qualified features in data matrix after filtering
##data(iris) X = iris[,1:4] Keep_feat = DF_dataPre(X)
##data(iris) X = iris[,1:4] Keep_feat = DF_dataPre(X)
This is a script of decision forest for easy use t
DF_easy(Train_X, Train_Y, Test_X, Test_Y, mode = "default")
DF_easy(Train_X, Train_Y, Test_X, Test_Y, mode = "default")
Train_X |
Training Dataset |
Train_Y |
Training data endpoint |
Test_X |
Testing Dataset |
Test_Y |
Testing data endpoint |
mode |
pre-defined modeling |
data_matrix training and testing result
# data(demo_simple) X = iris[,1:4] Y = iris[,5] names(Y)=rownames(X) random_seq=sample(nrow(X)) split_rate=3 split_sample = suppressWarnings(split(random_seq,1:split_rate)) Train_X = X[-random_seq[split_sample[[1]]],] Train_Y = Y[-random_seq[split_sample[[1]]]] Test_X = X[random_seq[split_sample[[1]]],] Test_Y = Y[random_seq[split_sample[[1]]]] Result = DF_easy(Train_X, Train_Y, Test_X, Test_Y)
# data(demo_simple) X = iris[,1:4] Y = iris[,5] names(Y)=rownames(X) random_seq=sample(nrow(X)) split_rate=3 split_sample = suppressWarnings(split(random_seq,1:split_rate)) Train_X = X[-random_seq[split_sample[[1]]],] Train_Y = Y[-random_seq[split_sample[[1]]]] Test_X = X[random_seq[split_sample[[1]]],] Test_Y = Y[random_seq[split_sample[[1]]]] Result = DF_easy(Train_X, Train_Y, Test_X, Test_Y)
performance evaluation between two factors
DF_perf(pred, label)
DF_perf(pred, label)
pred |
Predictions |
label |
Known-endpoint |
result$ACC: Predicting Accuracy
result$MIS: MisClassfication Counts
result$MCC: Matthew's Correlation Coefficients
result$bACC: balanced Accuracy
Decision Forest algorithm: Model prediction with constructed DF models. DT_models is a list of Decision Tree models (rpart.objects) generated by DF_train() DT_train_CV() is only designed for Cross-validation and won't generate models
DF_pred(DT_models, X, Y = NULL)
DF_pred(DT_models, X, Y = NULL)
DT_models |
Constructed DF models |
X |
Test Dataset |
Y |
Test data endpoint |
.$accuracy: Overall test accuracy
.$predictions: Detailed test prediction
# data(demo_simple) X = data_dili$X Y = data_dili$Y names(Y)=rownames(X) random_seq=sample(nrow(X)) split_rate=3 split_sample = suppressWarnings(split(random_seq,1:split_rate)) Train_X = X[-random_seq[split_sample[[1]]],] Train_Y = Y[-random_seq[split_sample[[1]]]] Test_X = X[random_seq[split_sample[[1]]],] Test_Y = Y[random_seq[split_sample[[1]]]] used_model = DF_train(Train_X, Train_Y) Pred_result = DF_pred(used_model,Test_X,Test_Y)
# data(demo_simple) X = data_dili$X Y = data_dili$Y names(Y)=rownames(X) random_seq=sample(nrow(X)) split_rate=3 split_sample = suppressWarnings(split(random_seq,1:split_rate)) Train_X = X[-random_seq[split_sample[[1]]],] Train_Y = Y[-random_seq[split_sample[[1]]]] Test_X = X[random_seq[split_sample[[1]]],] Test_Y = Y[random_seq[split_sample[[1]]]] used_model = DF_train(Train_X, Train_Y) Pred_result = DF_pred(used_model,Test_X,Test_Y)
Decision Forest algorithm: Model training
DF_train(X, Y, stop_step = 5, Max_tree = 20, min_split = 10, cp = 0.1, Filter = F, p_val = 0.05, Method = "bACC", Quiet = T, Grace_val = 0.05, imp_accu_val = 0.01, imp_accu_criteria = F)
DF_train(X, Y, stop_step = 5, Max_tree = 20, min_split = 10, cp = 0.1, Filter = F, p_val = 0.05, Method = "bACC", Quiet = T, Grace_val = 0.05, imp_accu_val = 0.01, imp_accu_criteria = F)
X |
Training Dataset |
Y |
Training data endpoint |
stop_step |
How many extra step would be processed when performance not improved, 1 means one extra step |
Max_tree |
Maximum tree number in Forest |
min_split |
minimum leaves in tree nodes |
cp |
parameters to pruning decision tree, default is 0.1 |
Filter |
doing feature selection before training |
p_val |
P-value threshold measured by t-test used in feature selection, default is 0.05 |
Method |
Which is used for evaluating training process. MIS: Misclassification rate; ACC: accuracy |
Quiet |
if TRUE (default), don't show any message during the process |
Grace_val |
Grace Value in evaluation: the next model should have a performance (Accuracy, bACC, MCC) not bad than previous model with threshold |
imp_accu_val |
improvement in evaluation: adding new tree should improve the overall model performance (Accuracy, bACC, MCC) by threshold |
imp_accu_criteria |
if TRUE, model must have improvement in accumulated accuracy |
.$accuracy: Overall training accuracy
.$pred: Detailed training prediction (fitting)
.$detail: Detailed usage of Decision tree Features/Models and their performances
.$models: Constructed (list of) Decision tree models
.$Method: pass evaluating Methods used in training
.$cp: pass cp value used in training decision trees
##data(iris) X = iris[,1:4] Y = iris[,5] names(Y)=rownames(X) used_model = DF_train(X,factor(Y))
##data(iris) X = iris[,1:4] Y = iris[,5] names(Y)=rownames(X) used_model = DF_train(X,factor(Y))
Draw plot for Dforest test results
DF_Trainsummary(used_model, plot = T)
DF_Trainsummary(used_model, plot = T)
used_model |
Training result |
plot |
if TRUE (default), draw plot |
Demo script to lean Decision Forest package Demo data are located in data/ folder
Dforest()
Dforest()
Leihong.Wu
Dforest()
Dforest()
Multiple plot function
If the layout is something like matrix(c(1,2,3,3), nrow=2, byrow=TRUE), then plot 1 will go in the upper left, 2 will go in the upper right, and 3 will go all the way across the bottom.
multiplot(..., plotlist = NULL, cols = 1, layout = NULL)
multiplot(..., plotlist = NULL, cols = 1, layout = NULL)
... |
ggplot objects |
plotlist |
a list of ggplot objects |
cols |
Number of columns in layout |
layout |
A matrix specifying the layout. If present, 'cols' is ignored. |
Doing Prediction with Decision Tree model
Pred_DT(model, X)
Pred_DT(model, X)
model |
Decision Tree Model |
X |
dataset |
Decision Tree Predictions Different endpoints presented in multiple columns
rpart
rpart