IBM 15 Manual

A SERVICE OF

next previous

35

Understanding Data Mining

Classiﬁcation nodes

The Aut o Classiﬁer node creates and compares a number o f different models for

binary outcomes (yes or no, churn or do not churn, and so on), allowing you to

choose the best approach for a given analysis. A number of modeling algorithms are

supported, making it possi ble to select the methods you want to use, the speciﬁc

options for each, and the criteria for comparing the results. The node generates a set

of models based on the speciﬁed options and ranks the best candidates according to

the criteria you specify.

The Aut o Numeric node es t i mates and compares models for continuous numeric

range outcomes using a numb er of different methods. The nod e w orks in the same

manner as the Aut o Classiﬁer node, al l owing you to choose the algorithms to use

and to experiment with multiple combinatio ns of options in a single modeling pass.

Supported algorithms include neural networks, C&R Tree, CHAID, linear regression,

generalized linear regression, and support vector machines (SVM). Models can be

compared based on correlation, relative error, or number of variables used.

The Classiﬁcation and Regression (C&R) Tree node generates a decision tree that

allows you to predict or classify future observations. The m et hod uses recursive

partitioning t o split the training records into segments by minimizing the impurity

at each step, where a node in the tree is considered “pure” if 100% of cases in the

node fall into a speciﬁc cat egory of the target ﬁeld. Target and input ﬁelds can

be numeric ranges or categorical (nominal, ordinal, or ﬂags); all splits are binary

(only two subgroups).

The QUEST node provides a binary classiﬁcation method for building decision trees,

designed to reduce the processing time required for large C&R Tree analyses while

also reducing the tendency found in classiﬁcation tree methods to favor inputs that

allow more splits. Input ﬁelds can be numeric ranges (continuous), but the target ﬁeld

must be categorical. All splits are bin ary.

The CHAID node generates decision trees using chi-square statistics to identify

optimal splits. Unlike the C&R Tree and QUEST nodes, CHAID can generate

nonbinary tree s, meaning that some splits have more than two branches. Target and

input ﬁelds can be numeric range (continuou s) or categorical. Exhaustive CHAI D is

a modiﬁcat i on of CHAID that does a m ore thorough job of examining all possible

splits but takes lo nger to compute.

The C5.0 node builds either a decision tree o r a rule set. The model works by splitting

the sample based on the ﬁeld that provides the max i mum information gain at each

level. The target ﬁeld must be categorical. Multiple splits into more than two

subgroups are allowed.

The Decision List node identiﬁes subgroups, or segments, that show a higher or lower

likelihood of a given binary outcome relative to the overall population. For example,

you might look for customers who are unlikely to churn or are most likely to respond

favorably to a campaign. You can incorporate your business knowledge into the

model by adding your own custom segments and previewing alternative models side

by side to compare the results. Decision List models consist of a list of rules in which

each rule has a condition and an outcome. Rules are applied in order, and the ﬁrst rule

that matches determines the outcome.

Linear regression models predict a continuous target based on li near relationships

between the target and one or more predictors.