 Inductive Solutions, Inc.
380 Rector Place, Suite 4A, New York, New York 10280

Email  Telephone: +1 (212)945.0630

Home

Products and Services
Software Products
Recommended Books

Bibliography and White Papers

## RunPCA

RunPCA is an information discovery ("datamining") tool based on Principal Component Analysis, a statistical method that transforms a set of data inputs into a new smaller set of uncorrelated inputs ordered by information content. RunPCA requires 64-bit implentations of Windows. It is based on a very fast C/C++ code and is limited by dynamic memory.

RunPCA Features

• Computes means, variances, covariances, and correlations of large data sets
• Computes and ranks principal components and their variances
• Automatically transforms data sets

Benefits

• Easy-to-Learn and Easy-to-Use Excel Spreadsheet User Interface
• Computation is very fast
• The RunPCA C/C++ Library is available for further customization

For example, suppose we have a table of 1000 rows and 3 columns (or "factors") and we want to discover some sort of relationship between the columns.  The following table shows how the variance of the data of each column is distributed:

Variance Fraction Accumulated
0.381745 66.57 66.57
0.095436 16.64 18.32
0.096271 16.79 100

The most information (highest variance) is contained in the first column (almost two-thirds of the information as indicated in the first row of the table).  The remaining information is split almost evenly into the other two factors (as indicated in the next two rows).

After processing by RunPCA, the three original columns are transformed to "principal factors."  Now the variance of the transformed data (consisting of the three principal factors) is distributed as follows:

Variance Fraction Accumulated
0.494189 86.18 86.18
0.079263 13.82 100
0 0 100

Now most of the information (highest variance) is contained in the first principal factor (86% of the information). The remaining information is contained entirely in the second principal factor.  This effectively reduces the dimension of the data by 33%.  This means that if we have additional data of observed responses (or target outputs), then we can perform regression (or train a neural network) using only two columns of the transformed data, rather than the three columns of the original data. This can improve the speed and accuracy of the training or regression.