Hello again, it’s been a bit of time since my last post, but I have not been idle. The reality is that most of my efforts have been directed fully toward my company’s needs and our move over to Google BigQuery. I haven’t developed many more capabilities that I can share publicly. KNIME & GoogleBigQuery have been great enabling me to move very fast with data and produce a few proof-of-concepts. Just to think that maybe it was only 8 years ago I was doing everything in Microsoft Access and Excel. Gone are those days. Microsoft Excel is now just my Notepad.
One thing I do not have yet is the ability in KNIME to upload data back into BigQuery. Perhaps that will be a capability soon, but BigQuery is all flush and fill, and KNIME seems to need a batch ability. So long-story-short I wanted to do Principle Component Analysis on a dataset and transform two variables into their principle components but have it scheduled in Google BigQuery so it will automatically refresh when new data arrives. So I reduced the problem into one totally solvable in BQ SQL. I will not go into depth about why you need to do PCA, what the covariance matrix tells us about the data, nor how to interpret the eigenvalues or loadings. That horse has been beat with a stick by people far more capable than me. This is purely a discrete approximation implemented in SQL. I am indebted to Lindsay I Smith who wrote this paper here:
Step 0: Creating the test dataset and finding correlations.
Using the natality public data in Google BigQuery, I have created a dataset with at least two columns I assume are very correlated: mother_age and father_age.
CREATE OR REPLACE TABLE `projectid.datasetid.zzz_example_table` as ( SELECT source_year ,year ,month ,day ,wday ,state ,mother_age ,father_age ,gestation_weeks ,apgar_5min ,weight_gain_pounds ,weight_pounds ,is_male FROM `bigquery-public-data.samples.natality` where year >= 2005 and father_age <> 99 --order by year desc LIMIT 20000 );
Then I used Google BigQuery’s function called Corr to see the correlations between columns and as expected, mother_age and father_age are correlated.
Select CORR(mother_age,father_age) as ParentAge_Correl from projectid.datasetid.zzz_example_table` where father_age <> 99
… and as expected, those two columns are correlated: ParentAge_Correl : 0.78433544 Highly correlated variables lead me to believe I will find a strong component from them.
Step 1: Creating the Covariance Matrix.
Google BigQuery has done a great job on their statistical functions, and similar to the correlation function utilize above, I will first make a covariance matrix using the tools already made available. You may want to make this a table as it will be used more than once, however the dataset is so small and BigQuery so fast that I didn’t see the need to.
Select round(COVAR_POP(mother_age,mother_age),5) as ParentAge_Covar11 ,round(COVAR_POP(father_age,mother_age),5) as ParentAge_Covar12 ,round(COVAR_POP(mother_age,father_age),5) as ParentAge_Covar21 ,round(COVAR_POP(father_age,father_age),5) as ParentAge_Covar22 from `projectid.datasetid.zzz_example_table` where father_age <> 99
I am going to treat this as one row, with the suffix “*_RowIdColumnId” on any fields that are part of a matrix and starting the index at one. One could make an actual matrix out of this using a union. The it produces this output:
In addition, I have rounded everything on this matrix to the 5th decimal place. If your data needs more precision or is very small, you will need to adjust accordingly, but this is accurate enough for my needs.
Step 2: Creating a dataset of possible values to solve for the eigenvalues.
My personal familiarity with the subject matter leads me to believe that human fecundity in this current technological age ranges between teenage for mother and to maybe sixties for the father and positively correlated and as a result I expect my eigenvalues to be between 0 and 100 years of age. These are very big assumptions. I encourage you to run it through KNIME’s various modules for data you do not know very well. The below code will generate 1 million rows from 0.0000 to 99.9999, but this is super easy in BigQuery. What’s a million rows these days?
with nbrs as ( Select 1 as nbr union all Select 0 union all Select 2 union all Select 3 union all Select 4 union all Select 5 union all Select 6 union all Select 7 union all Select 8 union all Select 9 ), valuez as ( Select round(tens.nbr*10.0+ones.nbr*1.0+tenths.nbr/10.00+ hundies.nbr/100.00 + thousies.nbr/1000.00 + tenthousies.nbr / 10000.00,5) as value from nbrs tens join nbrs ones on 1=1 join nbrs tenths on 1=1 join nbrs hundies on 1=1 join nbrs thousies on 1=1 join nbrs tenthousies on 1=1 order by 1 desc ) Select * from valuez
If your eigenvalues fall outside the range of this specific example, adjust accordingly.
Step 3: Using the SQL where clause as a math solver.
Somebody on the internet is training people to use functions to create smart indices on SQL databases. This was pretty novel in my opinion and he saved the left hand side of a function to be an index and used the where clause as a solver. That’s pretty bad-ass if you ask me. I will attempt to utilize that method. Now I am going to combine all the previous steps as common table expressions (CTE) and then solve for the eigenvalue. I will select a long list of values I’m certain the eigenvalue exists in a table statement, Cartesian joining the covariance matrix, allowing for a rather loose threshold of being between -0.0500 and 0.0500 in the where clause, finally grouping those possible answers and ordering them by accuracy and then finding the two most accurate answers. Note! depending on the accuracy you need and the rounding you select, you may get back zero, one, or two eigenvalues. Play around with the rounding and adjust as needed.
CREATE OR REPLACE TABLE `projectid.datasetid.zzz_example_eigenvalues` as ( with nbrs as ( Select 1 as nbr union all Select 0 union all Select 2 union all Select 3 union all Select 4 union all Select 5 union all Select 6 union all Select 7 union all Select 8 union all Select 9 ), valuez as ( Select round(tens.nbr*10.0+ones.nbr*1.0+tenths.nbr/10.00+ hundies.nbr/100.00 + thousies.nbr/1000.00 + tenthousies.nbr / 10000.00,5) as value from nbrs tens join nbrs ones on 1=1 join nbrs tenths on 1=1 join nbrs hundies on 1=1 join nbrs thousies on 1=1 join nbrs tenthousies on 1=1 order by 1 desc ) ,CovarMatrix as ( Select round(COVAR_POP(mother_age,mother_age),5) as ParentAge_Covar11 ,round(COVAR_POP(father_age,mother_age),5) as ParentAge_Covar12 ,round(COVAR_POP(mother_age,father_age),5) as ParentAge_Covar21 ,round(COVAR_POP(father_age,father_age),5) as ParentAge_Covar22 from `np-quotecenter.temp_kxe.zzz_example_table` where father_age <> 99 ) Select EigenValue ,AccuracyValueRank ,DENSE_RANK() over (order by EigenValue desc) as EigenValueID from ( Select * ,dense_rank() over (order by abs(Accuracy) asc) as AccuracyValueRank from ( Select * ,round(POW(EigenValue,2)-Lambda1*EigenValue+integervalue,5) as Accuracy from ( Select valuez.value as EigenValue ,ParentAge_Covar11*ParentAge_Covar22-ParentAge_Covar21*ParentAge_Covar12 as integervalue ,ParentAge_Covar11+ParentAge_Covar22 as Lambda1 from valuez,CovarMatrix -- union all --not sure if negatives are needed. Distant Math memories seem in tell me yes eigenvectors can be negative -- Select -- -1*valuez.value -- ,ParentAge_Covar11*ParentAge_Covar22-ParentAge_Covar21*ParentAge_Covar12 as integervalue -- ,ParentAge_Covar11+ParentAge_Covar22 as Lambda1 -- from valuez,CovarMatrix ) a where round(POW(EigenValue,2)-Lambda1*EigenValue+integervalue,5) between -0.0500 and 0.0500 group by 1,2,3,4 --order by 4 desc ) a order by AccuracyValueRank asc ) aa where AccuracyValueRank = 1 );
I am grouping the values, finding the numbers that get me the closest to zero and then selecting those two.
If you run the above code in Google BigQuery, you will get these results:
How close were those eigenvalues compared to the real ones?
|Row||KNIME Eigenvalue||SQL Eigenvalue||% Diff|
The KNIME eigenvalue came from the PCA compute module, and a quick check confirms they are very close to each other.
Using the where clause as a solver is not the most ideal solution, but it works well for now. A Future application is the use an actual factorization formula to find the eigenvalues accurately and without the million rows, but it worked well enough for this example. I’m not going to sugar-coat this, it is possible that if the eigenvalues are very close then the above method will not work. I used two columns I knew were very correlated and likely Fathers age > Mothers age. I believe you’ll have to uncomment the negative union all valuez part to find those.
Next week I will calculate the rest of the matrix you see in the KNIME image above, take the original data and transform it into the new principle component dimensions. Then I will use the Corr() function in BigQuery to check and make sure the new PCA components are not correlated.