IBM 15 Switch User Manual


 
Chapter
6
66
6
Handling Missing Values
Overview of Missing Values
During the Data Preparation phase of data mining, you will often want to replace missing values
in the data. Missing values are values in the data set that are unknown, unc ollected, or incorrectly
entered. Usually, such valu es ar e invalid for their elds. For example, the eld Sex sh ould contain
the values M and F. If you discov er the values Y or Z in the eld, you can safely assume tha t such
values are invalid and shou ld therefore be interpreted as blanks. Likew is e, a negative value for the
eld Age is meaningless and should also be interprete d as a bla nk. Frequently, such obviously
wrong values are purpose ly ente r ed, or elds left blank, during a questio nnaire to indicate a
nonresp onse. At times, you may want to examine these blanks more closely to determine whether
a nonresponse, such as the refusal to give one’s age, is a factor in predicting a specic outcome.
Some modeling techniques handle missing data better than others. For example, C5.0 and
Apriori cope well with values that are explicitly declared as “missing” in a Type n ode. Other
modeling techniques have trouble dealing with missing values and experience longer training
times, res ulting in less-accur ate models.
There are several types of missing values recognized by IBM® SPSS® Modeler:
Null or system-missing values.
These are nonstring values that have been left blank in the
database or source le and have not been specically dened as “missing” in a source or
Type node. System-mis sing values are displayed as $null$. Note that empty strings are not
considered nulls in SPSS Modeler, although they may be treated as nulls by certain databases.
Empty strings and white space.
Empty string values and white space (strings with no visible
characters) are treated as distinct from null v alues. Empty strings are treate d as equ ivalent to
white space for most purposes. For example, if you select the option to treat white space as
blanks in a source or Type node, this setting applies to empty strings as well.
Blank or user-defined missing values.
These are values such as unknown, 99, or –1 that are
explicitly dened in a source node or Type node as missing. Optionally, you can also choose
to treat nulls an d white space as blanks, which allows them to be agged for special treatment
and to be excluded from most calculations. For example, yo u can use the @BLANK function to
treat these v alues, along with other types of missing values, as bla nks.
© Copyright IBM Corporation 1994, 2012.
99