IBM 15 Switch User Manual

Open as PDF

of 270

233

Performance Considerations for Streams and Nodes

Note: The following databas es support temporary table s for the purpose of caching: DB2,

Netezza, Orac

le, SQL Server, and Teradata. Other databases will use a normal table for database

caching. The SQL code can be customized for speciﬁc databases - contact Support for assistance.

Performance

: Process Nodes

Sort.

The Sort node must read the entire input data set before it can be sorted. The data is stored in

memory up to some limit, and the ex cess i s spilled to disk. The sorting algorithm is a combination

algorithm: data is read into memory up to the limit and sorted using a fast hybrid quick-sort

algorithm. If all the data ﬁts in memory, then the sort is complete. Oth erwise, a merge-sort

algorithm is applied. The sorted data is written to ﬁle and the next chunk of data is read into

memory, sorted, and written to disk. This is re peated until all the data has been read; then the sorted

chunks are merged. Merging may require repeated passes over the data stored on disk. At peak

usage, the Sort node will have two complete copies of the data set on disk: sorted and unsorted.

The overall running time of the algorithm is on the order of N*log(N), where N is the number

of records. Sorting in memory is faster than merging from dis k, so the actual running time can

be r educed by allocating more m emory to the sort. The algorithm allocates to itself a fract ion of

physical RAM controlled by the IBM® SPSS® Modeler Server conﬁguration option Memory

usage multiplier. To increase the memory used for sorting, provide more physical RAM or increase

this v alue. Note that when the proportion of memory used exceeds the working set of the process

so that part of the memory is paged to disk, performance degrades because the memory-access

pattern of the in-memory sort algorithm is random and can cause excessive paging. The sort

algorithm is used by several nodes other than the Sort node, but the same performance rules ap ply.

Binning.

The Binning node reads the entire input data set to co mpute the bin boundaries, before

it allocat es records to bins. The data set is cached while t he boundarie s are computed; then it is

rescanned for allocation. When the binning method is ﬁxed-width or mean+standard deviation,

the data set is cached directly to disk. The se methods have a linear running time and require

enough disk space to store the entire data set. When the binning method is ranks or tiles, the data

set is sorted using the sort algorithm described earlier, and the sorted data set is used as the cache.

Sorting gives these methods a running time of M*N*log(N), where M is the number of binned

ﬁelds and N is the number of records; it requires disk space equal to twice the data set size.

Generating a Derive node based on generated bins will improve performance in subsequent

passes. Derive operations are much faster than binning.

Merge by Key (Join).

The Merge node, when the merge method is keys (equivalent to a database

join), sorts each of its input data sets by the key ﬁelds. This part of the pr ocedure has a running

time of M*N*log(N), wh ere M is the number of inputs and N is the number of records in the largest

input; it requires sufﬁcient disk space to store all of its input data sets plus a second copy of the

largest data set. The running time of the merge itself is proportional to the size of the output

data set, which depends on the frequency of matching keys. In the worst case, where the ou tput

is the Cartesian product of the i nputs, the running time may approach NM. This is rare—most

joins have many fewer matching keys. If one data set is relatively larger than the other(s), or if

the incoming data is already sorted by a key ﬁeld, then you can improve the performance of

this node using the Optimization tab.

previous next

Top Automotive Device Types

Top Automotive Brands

Top Baby Care Device Types

Top Baby Care Brands

Top Car Audio & Video Device Types

Top Car Audio & Video Brands

Top Cellphone Device Types

Top Cellphone Brands

Top Communications Device Types

Top Communications Brands

Top Computer Device Types

Top Computer Brands

Top Fitness Device Types

Top Fitness Brands

Top Home Audio Device Types

Top Home Audio Brands

Top Household Appliance Device Types

Top Household Appliance Brands

Top Kitchen Appliance Device Types

Top Kitchen Appliance Brands

Top Laundry Appliance Device Types

Top Laundry Appliance Brands

Top Lawn & Garden Device Types

Top Lawn & Garden Brands

Top Marine Equipment Device Types

Top Marine Equipment Brands

Top Musical Instrument Device Types

Top Musical Instrument Brands

Top Outdoor Cooking Device Types

Top Outdoor Cooking Brands

Top Personal Care Device Types

Top Personal Care Brands

Top Photography Device Types

Top Photography Brands

Top Portable Media Device Types

Top Portable Media Brands

Top Power Tools Device Types

Top Power Tools Brands

Top TV and Video Device Types

Top TV and Video Brands

Top Videogame Device Types

Top Videogame Brands

IBM 15 Switch User Manual