IBM 15 Switch User Manual


 
101
Handling Missing Values
In general terms, there are tw o approaches you can follow :
You can exclude elds or records with missing values
You can impute, replace, or co erce missing values using a variety of methods
Both of these approaches can be largely automated using the Data Audit node. For example, you
can generate a Filter node that excludes elds with too many missing value s to be useful in
modeling, and generate a Supernode that imputes missing values for any or all of the elds that
remain. This is where the real power of the audit comes in, allowing you not only to assess the
current state of your data, but to take action based on the assessment.
Handling Records with Missing Values
If the majority of missing values is concentrated in a small number of records, you can just
exclude those records. For example, a bank usually keeps detailed and comp lete records on
its loan c ustomer s. If, however, the bank is less restrictive in approving loans for its own staff
members, data gathered for staff loans is likely to have several blank elds. In such a case, there
are two options for handling these missing values:
You can use a Select node to remove the staff records.
If the data set is large, you can discard all records with blanks.
Handling Fields with Missing Values
If the majorit y of missing values is concentrated in a small number of elds , you can address them
at the eld level rather than at the record level . This approach also allows you to experiment with
the relative importan ce of particular elds before deciding on an approach for handling missing
values. If a eld is unimportant in modeling, it probably is not worth ke eping, regardless of how
many missing value s it h as.
For example, a market research company may collect data from a general question naire
containing 50 questions. Two of the questions address a ge and political persuasion, information
that many people are reluctant to give. In this case, Age and Political_persuasion have many
missing values.
Field Measurement Level
In determining which method to use , you should also consider the measurement level of e lds
with missing values.
Numeric fields.
For numeric eld types, such as Continuous, you should always eliminate any
non-numeric values before building a model, because many models will not function if blanks are
included in numeric elds.
Categorical fields.
For categorical elds
, su ch as Nominal and Flag, altering missing values is not
necessary but will increase the accuracy of the model. For example, a model that uses the eld Sex
will s till function with meaningless values, such as Y and Z, but removing all values other than M
and F will increase the accuracy of the model
.