In the last post, Principle Component Analysis in SQL with Google Bigquery. Part 1 I used SQL to find the eigenvalues. Now I will use the temp table I made there and demonstrate how to apply the transformation back to the original data. If you didn’t read that post or have your own eigenvalues and different data, then you should be able to apply this next part to your own data without reading the post. My data is stored in a temp table called
projectid.datasetid.zzz_example_table and my eignvalues are saved in a different table called
projectid.datasetid.zzz_example_eigenvalues both of which are just generic placeholders. I am not going into any detail on why/how, as the math behind it can be derived from this excellent paper by Lindsay I Smith. http://www.iro.umontreal.ca/~pift6080/H09/documents/papers/pca_tutorial.pdf
Step 1: Finding the resulting eigenvectors and the length of each vector.
In the query below, I am taking the two eigenvectors found in the previous post, the covariance matrix, and then subtracting the eigenvectors from the diagonal, setting x2 equal to one, and then solving for x1 for each eigenvalue. Then I am calculating the length for each vector.
CREATE OR REPLACE TABLE `projectid.datasetid.zzz_example_evectors` as ( with eval1 as ( Select max(EigenValue) as EigenValue1 from `projectid.datasetid.zzz_example_eigenvalues` where EigenValueID = 1 ) ,eval2 as ( Select min(EigenValue) as EigenValue2 from `projectid.datasetid.zzz_example_eigenvalues` where EigenValueID = 2 ) ,CovarMatrix as ( Select --this should be a saved table since we're using it twice, but whatever round(COVAR_POP(mother_age,mother_age),5) as ParentAge_Covar11 ,round(COVAR_POP(father_age,mother_age),5) as ParentAge_Covar12 ,round(COVAR_POP(mother_age,father_age),5) as ParentAge_Covar21 ,round(COVAR_POP(father_age,father_age),5) as ParentAge_Covar22 from `projectid.datasetid.zzz_example_table` where father_age <> 99 ) Select a.evector_11 ,a.evector_12 ,SQRT(POW(evector_11,2)+POW(evector_12,2)) as Vector1_Length ,a.evector_21 ,a.evector_22 ,SQRT(POW(evector_21,2)+POW(evector_22,2)) as Vector2_Length from ( Select -1.00*(ParentAge_Covar12)/(ParentAge_Covar11 - EigenValue1) as evector_11 ,1.00 as evector_12 --we are forcing this value to one and solving for the above using the first eigenvector ,-1.00*(ParentAge_Covar12)/(ParentAge_Covar11 - EigenValue2) as evector_21 ,1.00 as evector_22 --we are forcing this value to one and solving for the above using the seond eigenvector from CovarMatrix,eval1,eval2 ) a );
Next I will divide the eigenvectors by the vector length to re-scale the data and get the final matrix, that we will then use to transform the original data using matrix multiplication. Again I will reference the paper above as invaluable to knowing this step. These lengths are Vector1_Length and Vector2_Length in the SQL above. Combing these in the query below will produce a decomposition matrix similar to what I would get out of KNIME.
Select * from ( Select 1 as RowID ,evector_11/Vector1_Length as Col_1 ,evector_12/Vector1_Length as Col_2 from `projectid.datasetid.zzz_example_table` union all Select 2 as RowID ,evector_21/Vector2_Length as Col_1 ,evector_22/Vector2_Length as Col_2 from projectid.datasetid.zzz_example_table` ) order by 1 asc
Output from the KNIME PCA Compute module:
Now I will take the original table data, find the means for mother_age and father_age in the tbl_avg CTE, then I will take the row’s value and subtract the mean_adjusted age from the rows.
CREATE OR REPLACE TABLE `projectid.datasetid.zzz_example_final_transform` as ( with tbl_avg as ( Select round(AVG(mother_age),6) as avg_mother_age ,round(AVG(father_age),6) as avg_father_age from `projectid.datasetid.zzz_example_table` ) , matrix as ( Select evector_11/Vector1_Length as Decomp_11 ,evector_12/Vector1_Length as Decomp_12 ,evector_21/Vector2_Length as Decomp_21 ,evector_22/Vector2_Length as Decomp_22 from `projectid.datasetid.zzz_example_evectors` ) Select * ,Decomp_11*mean_adjusted_mother_age+Decomp_12*mean_adjusted_father_age as PCA_Dim1 ,Decomp_21*mean_adjusted_mother_age+Decomp_22*mean_adjusted_father_age as PCA_Dim2 from ( select mother_age ,mother_age-avg_mother_age as mean_adjusted_mother_age ,father_age ,father_age-avg_father_age as mean_adjusted_father_age ,Decomp_11 ,Decomp_12 ,Decomp_21 ,Decomp_22 from `projectid.datasetid.zzz_example_table` a, tbl_avg b, matrix m ) a );
Finally I will run the correlation (CORR() in BigQery) and then run the correlation for the new data to verify that the resulting transformed dataset has zero correlation.
select round(corr(mother_age,father_age),6) as Original_data_Correlation ,round(corr(PCA_Dim1,PCA_Dim2),6) as PCA_Transformation_Correlation from `projectid.datasetid.zzz_example_final_transform`
As you can see above, the original correlation was .78, but now after this transformation the correlation between the two transformed dimensions is very close to zero. This is probably due to the rounding and approximation of the solution found by using the where clause as a solver. In addition I have visualized the results from the zzz_example_final_transform table in Tableau Public with a loose description of the two PCA dimensions.