IBM 15 Switch User Manual


 
31
Understanding Data Mining
together. It may not even be online. If it exists only on paper, data entry will be required before
you can begin d
ata mining.
Check whether the data covers the relevant attributes
The object of data mining is to identify relevant attributes, so including th is check may seem odd
at r st. It is very useful, however, to look at what data is available and to try to identify the likely
relevant factors that are not recorded. In trying to predict ice cream sales, for examp le, you may
have a lot of information about retail outlets or sales history, but you may not have weather
and temperature information, which is likely to play a signicant role. Missing attrib utes do
not necessarily mea n that data mining will not produce useful results, b ut they can limit the
accuracy of resulting predictions.
A quick way of assessing the situation is to perform a comprehensive audit of your data.
Before moving on, consider attaching a Data Audit node to your data source and running it to
generate a full report.
Beware of noisy data
Data often contain s errors or may contain subjective, and therefore variable, judgments. These
phenomena are collectively referred to as noise. Sometimes noise in data is normal. There may
well be underlying rules, but they may not hold for 100% of the cases.
Typically, the more noise there is in data, the more difcult it is to get accurate results .
However, SPSS Modeler’s machine-learning methods are able to han dle noisy data a nd have been
used successfully on data sets containing almost 50% noise.
Ensure that there is sufficient data
In data mining, it is not necessarily the size of a data s et t hat is important. The representativeness
of the data set is far more signicant, together with its coverage of po ssible outcomes and
combinations of variables.
Typically, the more attributes that are considered, the more records that will be needed to
give representativ e coverage.
If the data is representa tive and there are ge ne ral underlying rules, it may well be that a da ta
sample of a few thousand (or even a few hundred) records will give equally good results as a
million—and you will get the results more quickly.
Seek out the experts on the data
In many cases, you will be working on your own data and will therefore be highly familiar with
its content and meaning. However, if you are working on data for another departm en t of your
organization or for a client, it is highly desirable that you have access to experts who know the
data. They can guide you in the identication of relevant attributes and can help to inter pre t the
results of data mining, distinguishing the true nuggets of information from “fool’s gold,” or
artifact s caused by anomalies in the data sets.