Package 'Dforest' reference manual

Title:	Decision Forest
Description:	Provides R-implementation of Decision forest algorithm, which combines the predictions of multiple independent decision tree models for a consensus decision. In particular, Decision Forest is a novel pattern-recognition method which can be used to analyze: (1) DNA microarray data; (2) Surface-Enhanced Laser Desorption/Ionization Time-of-Flight Mass Spectrometry (SELDI-TOF-MS) data; and (3) Structure-Activity Relation (SAR) data. In this package, three fundamental functions are provided, as (1)DF_train, (2)DF_pred, and (3)DF_CV. run Dforest() to see more instructions. Weida Tong (2003) <doi:10.1021/ci020058s>.
Authors:	Leihong Wu <[email protected]>, Weida Tong ([email protected])
Maintainer:	Leihong Wu <[email protected]>
License:	GPL-2
Version:	0.4.2
Built:	2025-03-02 03:18:12 UTC
Source:	https://github.com/cran/Dforest

Performance evaluation from other modeling algorithm Result

Description

Performance evaluation from other modeling algorithm Result

Usage

cal_MCC(pred, label)
cal_MCC(pred, label)

Arguments

`pred`	Predictions
`label`	Known-endpoint

Value

result$ACC: Predicting Accuracy

result$MIS: MisClassfication Counts

result$MCC: Matthew's Correlation Coefficients

result$bACC: balanced Accuracy

Construct Decision Tree model with pruning

Description

Construct Decision Tree model with pruning

Usage

Con_DT(X, Y, min_split = 10, cp = 0.01)
Con_DT(X, Y, min_split = 10, cp = 0.01)

Arguments

`X`	dataset
`Y`	data_Labels
`min_split`	minimum number of node in each leaf
`cp`	pre-defined Complexity Parameter (CP) rpart program

Value

Decision Tree Model with pruning Implemented by rpart

QSAR dataset with DILI endpoint for demo

Description

This data set gives the DILI endpoint of various compounds (Most or No DILI-concern) with QSAR descriptors generated by MOLD2

Usage

riversrivers

Format

A List containing two vectors: X contains 958 observations and 777 variables. Y contains DILI endpoints of 958 observations

Source

In-house data

References

Minjun Chen (2011) FDA-approved drug labeling for the study of drug-induced liver injury. Drug discovery today

Performance evaluation from Decision Tree Predictions

Description

Performance evaluation from Decision Tree Predictions

Usage

DF_acc(pred, label)
DF_acc(pred, label)

Arguments

`pred`	Predictions
`label`	Known-endpoint

Value

result$ACC: Predicting Accuracy

result$MIS: MisClassfication Counts

result$MCC: Matthew's Correlation Coefficients

result$bACC: balanced Accuracy

T-test for feature selection

Description

T-test for feature selection

Usage

DF_calp(X, Y)
DF_calp(X, Y)

Arguments

`X`	X variable matrix
`Y`	Y label

Decision Forest algorithm: confidence level accumulated plot

Description

Draw accuracy curve according to the confidence level of predictions

Usage

DF_ConfPlot(Pred_result, Label, bin = 20, plot = T, smooth = F)
DF_ConfPlot(Pred_result, Label, bin = 20, plot = T, smooth = F)

Arguments

`Pred_result`	Predictions
`Label`	known label for Test Dataset
`bin`	How many bins occurred in Conf Plot (Default is 20)
`plot`	Draw Plot if True, otherwise output the datamatrix
`smooth`	if TRUE, Fit the performance curve with smooth function (by ggplot2)

Value

ACC_Conf: return data Matrix ("ConfidenceLevel", "Accuracy", "Matched Samples") for confidence plot (no plot)

ConfPlot: Draw Confidence Plot if True, need install ggplot2

Decision Forest algorithm: confidence level accumulated plot (accumulated version)

Description

Draw accuracy curve according to the confidence level of predictions

Usage

DF_ConfPlot_accu(Pred_result, Label, bin = 20, plot = T, smooth = F)
DF_ConfPlot_accu(Pred_result, Label, bin = 20, plot = T, smooth = F)

Arguments

`Pred_result`	Predictions
`Label`	known label for Test Dataset
`bin`	How many bins occurred in Conf Plot (Default is 20)
`plot`	Draw Plot if True, otherwise output the datamatrix
`smooth`	if TRUE, Fit the performance curve with smooth function (by ggplot2)

Value

ACC_Conf: return data Matrix ("ConfidenceLevel", "Accuracy", "Matched Samples") for confidence plot (no plot)

ConfPlot: Draw Confidence Plot if True, need install ggplot2

Decision Forest algorithm: Model training with Cross-validation

Description

Decision Forest algorithm: Model training with Cross-validation Default is 5-fold cross-validation

Usage

DF_CV(X, Y, stop_step = 10, CV_fold = 5, Max_tree = 20, min_split = 10,
  cp = 0.1, Filter = F, p_val = 0.05, Method = "bACC", Quiet = T,
  Grace_val = 0.05, imp_accu_val = 0.01, imp_accu_criteria = F)
DF_CV(X, Y, stop_step = 10, CV_fold = 5, Max_tree = 20, min_split = 10,
  cp = 0.1, Filter = F, p_val = 0.05, Method = "bACC", Quiet = T,
  Grace_val = 0.05, imp_accu_val = 0.01, imp_accu_criteria = F)

Arguments

`X`	Training Dataset
`Y`	Training data endpoint
`stop_step`	How many extra step would be processed when performance not improved, 1 means one extra step
`CV_fold`	Fold of cross-validation (Default = 5)
`Max_tree`	Maximum tree number in Forest
`min_split`	minimum leaves in tree nodes
`cp`	parameters to pruning decision tree, default is 0.1
`Filter`	doing feature selection before training
`p_val`	P-value threshold measured by t-test used in feature selection, default is 0.05
`Method`	Which is used for evaluating training process. MIS: Misclassification rate; ACC: accuracy
`Quiet`	if TRUE (default), don't show any message during the process
`Grace_val`	Grace Value in evaluation: the next model should have a performance (Accuracy, bACC, MCC) not bad than previous model with threshold
`imp_accu_val`	improvement in evaluation: adding new tree should improve the overall model performance (Accuracy, bACC, MCC) by threshold
`imp_accu_criteria`	if TRUE, model must have improvement in accumulated accuracy

Value

.$performance: Overall training accuracy (Cross-validation)

.$pred: Detailed training prediction (Cross-validation)

.$detail: Detailed usage of Decision tree Features/Models and their performances in all CVs

.$Method: pass evaluating Methods used in training

.$cp: pass cp value used in training decision trees

Examples

  ##data(iris)
  X = iris[,1:4]
  Y = iris[,5]
  names(Y)=rownames(X)

  random_seq=sample(nrow(X))
  split_rate=3
  split_sample = suppressWarnings(split(random_seq,1:split_rate))
  Train_X = X[-random_seq[split_sample[[1]]],]
  Train_Y = Y[-random_seq[split_sample[[1]]]]

  CV_result = DF_CV(Train_X, Train_Y)


##data(iris)
  X = iris[,1:4]
  Y = iris[,5]
  names(Y)=rownames(X)

  random_seq=sample(nrow(X))
  split_rate=3
  split_sample = suppressWarnings(split(random_seq,1:split_rate))
  Train_X = X[-random_seq[split_sample[[1]]],]
  Train_Y = Y[-random_seq[split_sample[[1]]]]

  CV_result = DF_CV(Train_X, Train_Y)

output summary for Dforest Cross-validation results

Description

Draw plot for Dforest Cross-validation results

Usage

DF_CVsummary(CV_result, plot = T)
DF_CVsummary(CV_result, plot = T)

Arguments

`CV_result`	Training Dataset
`plot`	if TRUE (default), draw plot

Decision Forest algorithm: Feature Selection in pre-processing

Description

Decision Forest algorithm: feature selection for two-class predictions, kept statistical significant features pass the t-test

Usage

DF_dataFs(X, Y, p_val = 0.05)
DF_dataFs(X, Y, p_val = 0.05)

Arguments

`X`	Training Dataset
`Y`	Training Labels
`p_val`	Correlation Coefficient threshold to filter out high correlated features; default is 0.95

Value

Keep_feat: qualified features in data matrix after filtering

Examples

 ##data(iris)
  X = iris[iris[,5]!="setosa",1:4]
  Y = iris[iris[,5]!="setosa",5]
  used_feat = DF_dataFs(X, Y)
##data(iris)
  X = iris[iris[,5]!="setosa",1:4]
  Y = iris[iris[,5]!="setosa",5]
  used_feat = DF_dataFs(X, Y)

Decision Forest algorithm: Data pre-processing

Description

Decision Forest algorithm: Data pre-processing, remove All-Zero columns/features and high correlated features

Usage

DF_dataPre(X, thres = 0.95)
DF_dataPre(X, thres = 0.95)

Arguments

`X`	Training Dataset
`thres`	Correlation Coefficient threshold to filter out high correlated features; default is 0.95

Value

Keep_feat: qualified features in data matrix after filtering

Examples

 ##data(iris)
  X = iris[,1:4]
  Keep_feat = DF_dataPre(X)
##data(iris)
  X = iris[,1:4]
  Keep_feat = DF_dataPre(X)

Simple pre-defined pipeline for Decision forest

Description

This is a script of decision forest for easy use t

Usage

DF_easy(Train_X, Train_Y, Test_X, Test_Y, mode = "default")
DF_easy(Train_X, Train_Y, Test_X, Test_Y, mode = "default")

Arguments

`Train_X`	Training Dataset
`Train_Y`	Training data endpoint
`Test_X`	Testing Dataset
`Test_Y`	Testing data endpoint
`mode`	pre-defined modeling

Value

data_matrix training and testing result

Examples

  # data(demo_simple)
  X = iris[,1:4]
  Y = iris[,5]
  names(Y)=rownames(X)

  random_seq=sample(nrow(X))
  split_rate=3
  split_sample = suppressWarnings(split(random_seq,1:split_rate))
  Train_X = X[-random_seq[split_sample[[1]]],]
  Train_Y = Y[-random_seq[split_sample[[1]]]]
  Test_X = X[random_seq[split_sample[[1]]],]
  Test_Y = Y[random_seq[split_sample[[1]]]]

  Result = DF_easy(Train_X, Train_Y, Test_X, Test_Y)
# data(demo_simple)
  X = iris[,1:4]
  Y = iris[,5]
  names(Y)=rownames(X)

  random_seq=sample(nrow(X))
  split_rate=3
  split_sample = suppressWarnings(split(random_seq,1:split_rate))
  Train_X = X[-random_seq[split_sample[[1]]],]
  Train_Y = Y[-random_seq[split_sample[[1]]]]
  Test_X = X[random_seq[split_sample[[1]]],]
  Test_Y = Y[random_seq[split_sample[[1]]]]

  Result = DF_easy(Train_X, Train_Y, Test_X, Test_Y)

performance evaluation between two factors

Description

performance evaluation between two factors

Usage

DF_perf(pred, label)
DF_perf(pred, label)

Arguments

`pred`	Predictions
`label`	Known-endpoint

Value

result$ACC: Predicting Accuracy

result$MIS: MisClassfication Counts

result$MCC: Matthew's Correlation Coefficients

result$bACC: balanced Accuracy

Decision Forest algorithm: Model prediction

Description

Decision Forest algorithm: Model prediction with constructed DF models. DT_models is a list of Decision Tree models (rpart.objects) generated by DF_train() DT_train_CV() is only designed for Cross-validation and won't generate models

Usage

DF_pred(DT_models, X, Y = NULL)
DF_pred(DT_models, X, Y = NULL)

Arguments

`DT_models`	Constructed DF models
`X`	Test Dataset
`Y`	Test data endpoint

Value

.$accuracy: Overall test accuracy

.$predictions: Detailed test prediction

Examples

  # data(demo_simple)
  X = data_dili$X
  Y = data_dili$Y
  names(Y)=rownames(X)

  random_seq=sample(nrow(X))
  split_rate=3
  split_sample = suppressWarnings(split(random_seq,1:split_rate))
  Train_X = X[-random_seq[split_sample[[1]]],]
  Train_Y = Y[-random_seq[split_sample[[1]]]]
  Test_X = X[random_seq[split_sample[[1]]],]
  Test_Y = Y[random_seq[split_sample[[1]]]]

  used_model = DF_train(Train_X, Train_Y)
  Pred_result = DF_pred(used_model,Test_X,Test_Y)



# data(demo_simple)
  X = data_dili$X
  Y = data_dili$Y
  names(Y)=rownames(X)

  random_seq=sample(nrow(X))
  split_rate=3
  split_sample = suppressWarnings(split(random_seq,1:split_rate))
  Train_X = X[-random_seq[split_sample[[1]]],]
  Train_Y = Y[-random_seq[split_sample[[1]]]]
  Test_X = X[random_seq[split_sample[[1]]],]
  Test_Y = Y[random_seq[split_sample[[1]]]]

  used_model = DF_train(Train_X, Train_Y)
  Pred_result = DF_pred(used_model,Test_X,Test_Y)

Decision Forest algorithm: Model training

Description

Decision Forest algorithm: Model training

Usage

DF_train(X, Y, stop_step = 5, Max_tree = 20, min_split = 10, cp = 0.1,
  Filter = F, p_val = 0.05, Method = "bACC", Quiet = T,
  Grace_val = 0.05, imp_accu_val = 0.01, imp_accu_criteria = F)
DF_train(X, Y, stop_step = 5, Max_tree = 20, min_split = 10, cp = 0.1,
  Filter = F, p_val = 0.05, Method = "bACC", Quiet = T,
  Grace_val = 0.05, imp_accu_val = 0.01, imp_accu_criteria = F)

Arguments

`X`	Training Dataset
`Y`	Training data endpoint
`stop_step`	How many extra step would be processed when performance not improved, 1 means one extra step
`Max_tree`	Maximum tree number in Forest
`min_split`	minimum leaves in tree nodes
`cp`	parameters to pruning decision tree, default is 0.1
`Filter`	doing feature selection before training
`p_val`	P-value threshold measured by t-test used in feature selection, default is 0.05
`Method`	Which is used for evaluating training process. MIS: Misclassification rate; ACC: accuracy
`Quiet`	if TRUE (default), don't show any message during the process
`Grace_val`	Grace Value in evaluation: the next model should have a performance (Accuracy, bACC, MCC) not bad than previous model with threshold
`imp_accu_val`	improvement in evaluation: adding new tree should improve the overall model performance (Accuracy, bACC, MCC) by threshold
`imp_accu_criteria`	if TRUE, model must have improvement in accumulated accuracy

Value

.$accuracy: Overall training accuracy

.$pred: Detailed training prediction (fitting)

.$detail: Detailed usage of Decision tree Features/Models and their performances

.$models: Constructed (list of) Decision tree models

.$Method: pass evaluating Methods used in training

.$cp: pass cp value used in training decision trees

Examples

  ##data(iris)
  X = iris[,1:4]
  Y = iris[,5]
  names(Y)=rownames(X)
  used_model = DF_train(X,factor(Y))

##data(iris)
  X = iris[,1:4]
  Y = iris[,5]
  names(Y)=rownames(X)
  used_model = DF_train(X,factor(Y))

output summary for Dforest test results

Description

Draw plot for Dforest test results

Usage

DF_Trainsummary(used_model, plot = T)
DF_Trainsummary(used_model, plot = T)

Arguments

`used_model`	Training result
`plot`	if TRUE (default), draw plot

Demo script to lean Decision Forest package Demo data are located in data/ folder

Description

Demo script to lean Decision Forest package Demo data are located in data/ folder

Usage

Dforest()
Dforest()

Author(s)

Leihong.Wu

Examples

  Dforest()
Dforest()

multiplot

Description

Multiple plot function

If the layout is something like matrix(c(1,2,3,3), nrow=2, byrow=TRUE), then plot 1 will go in the upper left, 2 will go in the upper right, and 3 will go all the way across the bottom.

Usage

multiplot(..., plotlist = NULL, cols = 1, layout = NULL)
multiplot(..., plotlist = NULL, cols = 1, layout = NULL)

Arguments

`...`	ggplot objects
`plotlist`	a list of ggplot objects
`cols`	Number of columns in layout
`layout`	A matrix specifying the layout. If present, 'cols' is ignored.

Doing Prediction with Decision Tree model

Description

Doing Prediction with Decision Tree model

Usage

Pred_DT(model, X)
Pred_DT(model, X)

Arguments

`model`	Decision Tree Model
`X`	dataset

Value

Decision Tree Predictions Different endpoints presented in multiple columns

Source

rpart

Package 'Dforest'

Help Index

Performance evaluation from other modeling algorithm Result

Description

Usage

Arguments

Value

Construct Decision Tree model with pruning

Description

Usage

Arguments

Value

See Also

QSAR dataset with DILI endpoint for demo

Description

Usage

Format

Source

References

Performance evaluation from Decision Tree Predictions

Description

Usage

Arguments

Value

T-test for feature selection

Description

Usage

Arguments

Decision Forest algorithm: confidence level accumulated plot

Description

Usage

Arguments

Value

Decision Forest algorithm: confidence level accumulated plot (accumulated version)

Description

Usage

Arguments

Value

Decision Forest algorithm: Model training with Cross-validation

Description

Usage

Arguments

Value

Examples

output summary for Dforest Cross-validation results

Description

Usage

Arguments

Decision Forest algorithm: Feature Selection in pre-processing

Description

Usage

Arguments

Value

Examples

Decision Forest algorithm: Data pre-processing

Description

Usage

Arguments

Value

Examples

Simple pre-defined pipeline for Decision forest

Description

Usage

Arguments

Value

Examples

performance evaluation between two factors

Description

Usage

Arguments

Value

Decision Forest algorithm: Model prediction

Description

Usage

Arguments

Value

Examples

Decision Forest algorithm: Model training

Description

Usage