D2MCS is an object-oriented framework able to identify and exploit the intrinsic characteristic of input data to (i) accurately distribute features in groups (feature clustering) and (ii) design and deploy effective MCS models. Below are included code snippets belonging to the different stages of D2MCS framework: (i) Data Manipulation, (ii) Feature Clustering, (iii) MCS Creation and (iv) Classification Results.
Furthermore, the package provides the facility to read the descriptions and details of all functions through the help(D2MCS) command.
The first step starts using the DatasetLoader class to convert the data to be analyzed into the structure compatible with D2MCS called Dataset (or HDDataset in case the dataset cannot be stored in memory). The following code fragment shows the parameters included in the load function after instantiating the DatasetLoader class.
<- DatasetLoader$new()
data.loader <- data.loader$load(filepath, header = TRUE, sep = ",",
data skip.lines = 0, normalize.names = FALSE,
ignore.columns = NULL)
Once the loading process is completed and the dataset is available in a Dataset object, it is possible to perform different methods divided into three main categories taking into account their behaviour: (i) dataset information obtainer, (ii) dataset column removal and (iii) dataset splitting operation. The following code snippet shows some of the different functions in each category.
## DATASET INFORMATION OBTAINER
$getNcol()
data$getNrow()
data$getColumnNames()
data$getDataset()
data
## DATASET COLUMN REMOVAL
$cleanData(columns = NULL,
dataremove.funcs = NULL,
remove.na = FALSE,
remove.const = FALSE)
## DATASET HANDLING AND SPLITTING
$createPartitions(num.folds = NULL,
datapercent.folds = NULL,
class.balance = NULL)
<- data$createSubset(num.folds = NULL,
subset column.id = NULL,
opts = list(remove.na = TRUE,
remove.const = FALSE),
class.index = NULL,
positive.class = NULL)
<- data$createTrain(num.folds = NULL,
train
class.index,
positive.class, opts = list(remove.na = TRUE,
remove.const = FALSE))
Using the createPartitions() method, the dataset is divided in order to use the divisions to create the data structures required in the following phases, using the createSubset() and createTrain() methods. While the first method performs the creation of a Subset object used both for clustering operations and for validation purposes. On the other hand, the second method is responsible for creating a Trainset object necessary to perform the model training stage.
After creating a Subset object, the stage two based on the distribution of features in clusters starts. The code snippet below exemplifies the three steps necessary to create and execute the clustering strategy called DependencyBasedStrategy included by default in D2MCS.
## FEATURE-CLUSTERING ALGORITHM CREATION
<- DependencyBasedStrategyConfiguration$new()
conf <- DependencyBasedStrategy$new(subset,
dbs
heuristic, configuration = conf)
## FEATURE-CLUSTERING ALGORITHM EXECUTION
$execute(verbose = FALSE)
dbs
## FEATURE-CLUSTERING ALGORITHM FUNCTIONALITIES
$getBestClusterDistribution()
dbs<- dbs$createTrain(subset,
dbs.train num.clusters = NULL,
num.groups = NULL,
include.unclustered = FALSE)
From the previous execution of the selected clustering strategy, a Trainset object is obtained to be used as input for the SMC creation phase. This stage is divided into three main steps (i) D2MCS framework initialization, (ii) MCS behaviour customization options and (iii) execution of MCS discovery operation.
## D2MCS FRAMEWORK INITIALIZATION
<- D2MCS$new(dir.path,
d2mcs num.cores = 2,
socket.type = "PSOCK",
outfile = NULL)
## MCS BEHAVIOUR CUSTOMIZATION OPTIONS
<- TwoClass$new(method,
trFunction
number,
savePredictions,
classProbs,
allowParallel,
verboseIter, seed = NULL)
## EXECUTION OF MCS DISCOVERY OPERATION
<- d2mcs$train(train.set,
trained
train.function, num.clusters = NULL,
model.recipe = DefaultModelFit$new(),
ex.classifiers = c(),
ig.classifiers = c(),
metrics = NULL,
saveAllModels = FALSE)
Using the code fragment shown previously, the training of the ML models provided by the caret library is performed in order to find out which models offer the best performance for the dataset, taking into account the indicated parameters.
After building the MCS, starts the next stage related to the classification of the data. To this end, the D2MCS tool needs to use the TrainOutput structure obtained by training the MCS, the dataset to be predicted and the voting schemes chosen to combine the results of the different MSC models.
## VOTING SCHEMES AVAILABLE IN THE CLASSIFICATION STAGE
<- c(SingleVoting$new(voting.schemes,
voting.types
metrics), $new(voting.schemes,
CombinedVoting
combined.metrics,
methodology,
metrics))
## EXECUTE THE CLASSIFICATION STAGE
<- d2mcs$classify(train.output,
predictions
subset,
voting.types, positive.class = NULL)
## COMPUTE THE PERFORMANCE OF EACH VOTING SCHEME
$getPerformances(test.set,
predictions
measures, voting.names = NULL,
metric.names = NULL,
cutoff.values = NULL)
## OBTAIN THE PREDICTIONS OBTAINED OF EACH VOTING SCHEME USED
$getPredictions(voting.names = NULL,
predictionmetric.names = NULL,
cutoff.values = NULL,
type = NULL,
target = NULL,
filter = FALSE)
When the classification stage is completed, the tool produces a ClassificationOutput object to allow the user to obtain information about the classification performance (getPerformances() method) and to observe in depth the predictions obtained (getPredictions() method).
The D2MCS package is also available in a development version at the Github development page: github.com/drordas/D2MCS