Science Fiction Now: Reactionless Drives

We all want to reach the stars. But how realistic it for near-future humanity to actually reach them? I am defining near-future as within 50 years. That’s just the technology time; not the travel time. There is little realistic way to reach the nearest stars without a multi-decade travel time.

The nearest star is Proxima Centauri is about 4.2 light-years away. One light year is approximate 9.4 Trillion KM! So 4.2*times that is almost 40 trillion kilometers to Proxima Centauri. Using Voyager as a benchmark, it left the solar system doing ~61k km/h or approximately 16.985 km/s. Suppose we use some unspecified “new technology improvements” and increase the solar-system exit speed 100x ( two orders of magnitude). How long will it take such probe to reach the nearest star moving at 1698 km/s?

Well this Super-Voyager would travel this incredible distance per year: 53.6 Billion KM / yr. {16.985 km/s * 31.5M Seconds / Year * 100 technology-factor}. So with no more velocity being added in, (nor any subtracted out once we get to the star) then humanity would need 744! years of travel before a souped-up Voyager-esque spacecraft would get to another star. To give a little more context to those years, the cathedral Sagrada Família started construction in 1882 and is expected to be completed in a decade. Thus society would need to commit to something at least 5-times longer than a century spanning cathedral. A city in German has completed three decades of a 1300 year public art installation called the time pyramid. Both are examples of society committing resources to projects that they will never see realized.

So clearly we need a system that both accelerates during the flight-time in order to get a spaceship to another star in 100 time frame. If we are able to continuously accelerate during the trip, we might also have the ability to also decelerate when we get to the star.

I will not be covering warp drives as I am trying to keep this down to realistic levels even though the math works.

Using a Reaction Drive

A reaction drive is just another name for a normal rocket engine that you would normally associate with space travel. Pushing a bit of mass faster one direction results in a proportional amount of mass being pushed the opposite direction. More information linked here, but this Wikipedia image is easy to comprehend for visual thinkers:

Long story short (but very meta): In order to accelerate the ship, you need to accelerate the fuel to accelerate the fuel, you later need to accelerate the ship. So to accelerate the ship from 0 to 1 on the chart below you need a mass ratio of ~4:1. 0 to 2 ~8:1, 0 to 3 ~20:1.

A linear increase in deltaV needs a geometric increase in mass.

Society has been able to accomplish human spaceflight with this geometric problem, but in order to realisticly travel to another star (with less than a few-hundred year life-time) we need something else.

So why a Reationless Drive?

First of all, I have come across the term “Propellentless Propulsion” as a synonym to Reactionless Drive. I have to admit I love the term. Catchy names and alliteration go a long way to selling an idea. For the sake of this blog I will keep using the original term.

A reactionless drive is a space propulsion device that does not require any fuel carried on board to be propelled. Proposals include solar sails, laser powered sails and other more exotic drives. So how close is humanity to creating a true reactionless drive that is capable of traveling interstellar distances? Not very close, but some engineering work has been accomplished (even in space).

Proposals using known Physics (or studied extensively)

Bussard ramjet:
This is one of the more popular proposals and has been studied for decades. This isn’t truly a reactionless drive, but achieves the same result. Instead of bringing the fuel with us, can we collect it in a magnetic scoop and burn it along the way? I won’t go into much detail, but instead point you the this article on Centauri Dreams. Short answer is ‘no’ that the vehicle would not be able to accelerate into the solar wind.

Photon Rocket:
This rocket is almost like a flashlight in space. Okay it is a little more complicated than that, however it is the most realistic proposal as it relies on no speculative science and uses established technology. One problem though is that the maximum speed that can be achieved right now (no fusion drives, just nuclear fission) is about 0.02% the speed of light or 60 km/s. Which is like a Jupiter assist at the edge of our solar system. Not enough for interstellar speed.

Laser Power and Solar Sails:
This concept is one of the most realistic proposals for interstellar flight. However, I was originally mistaken when I first thought about the idea. At first impression, a true solar sail would not reach interstellar speeds with any reasonable time frame (~100 years). Even if it did, it seemed like it would need another star of equal size to travel to if it wanted to stop. Laser powered craft could have accelerated it just enough to have it then decelerate by the continual pressure of the star, but the long deceleration time did not make the flight realistic. After researching it a little more, there were interesting proposals for reflecting the laser light back on the spacecraft using a mirror. The mirror would continue to accelerate, but it could slow down the spacecraft half way.

So that makes these proposals more most realistic for interstellar travel. Another check int heir box is the fact that they have been studied and tested in actual space. Everything from the Kepler Space Telescope to the IKAROS spacecraft to a future proposal used or plan to use solar radiation pressure to maneuver spacecraft.

The science works, the engineering has been tested, and it propels spacecraft! Take us to the STARS!

Electric Sail / Dipole Drive.
This proposal is currently in the development stage. From my research I have only read of small one satellite to use this proposal. “Space” is not empty. There are numerous charged particles flying around. These proposals use designs to propel the particles flying through space a certain direction and thus propel the spacecraft in the opposite direction.

Most of the information I am using comes from this excellent video by Dr. Robert Zubrin:

A second updated podcast (June 2020) is available here (~30 minutes in). Basically with a double-sail electric device we can propel the neutrons in space in one direction, and the electrons will be propelled in the opposite direction. However since the mass difference is almost 2 orders of magnitude (an electric is about 2% of a neutron) and interstellar space has a relative equal amount of both, a spacecraft can push against the interstellar medium. The double grids can be maintained at their proper spacing with spin alone, though I suspect it is possible that some force will be needed to maintain revolutions over the vast interstellar distances. There might be a method to deflect in such a way to impart spin (not sure how that would work without adding more mass to deform the grids.

Now the biggest roadblock is not the ability or the physics, its power. Dr. Zubrin mentions this one of the videos, but he considers beamed power to be a valid option. Still, even with a small mass and unlimited power from beaming, I needing to increase the speed and decrease the spaceship’s mass by at least 10x -100x each to achieve a ‘more realistic’ ~100 year performance.

Another Proposal using Unknown Physics:

EM Drive:
I have added this to let people know that yes sometimes engineering (aka reality) can inform pure science. Is it possible that engineering can create something without the science being confirmed prior? Yes absolutely. History had many inventions before science could explain how they worked. A few years ago this drive came up in the ‘news’. The initial research produced a very marginal effect. However this was not duplicated. In addition one claim was that any propulsive force measured was really just heat escaping from the drive, and is basically just an inefficient photon rocket at that point. The apparent thrust is far smaller than any testing margin of error. Thus even if it is real, it’s not going to propel us to the stars.

I have put in this speculative drive, to remind the reader that there are still unknown forces in nature. Intuition about the unknown unkowns could be shown in anomalies in thrust from some of our distant probes, and unexplained gains during flybys. Since I am writing in the hard sci-fi genre drives like this are too speculative for me to consider.


So from the proposals I outlined above, I have hopefully shown why I chose the Electric Sail when traveling to another star. Even though I am taking great liberties with the performance in my upcoming book, I still have it taking over 100 years! And that is with a +50x performance over currently existing technology.

Principle Components Analysis PCA in BigQuery

Google BigQuery has release Principle Components Analysis (PCA) into the wild. If you’d like to see what BQ.ML can do please check out this link.

This post follows up the manual process I did in the distant past (even before stored procedures ūüėÖ). Part1 and Part2. I have decided to update the process to include the new features in BigQuery.ML.

First we are going to use public natality data again. This is a different dataset but you should be able to find it here in BigQuery: bigquery-public-data.sdoh_cdc_wonder_natality.county_natality . I have pulled out an interesting correlation below. The relationship between Birth Weight and Avg Number of weeks pregnant. More Weeks = Bigger Babies.

Correlation between Birth Weight in Grams and Average number of gestational weeks. Each dot is a county,

So now we are going to throw this into the PCA algorithm in BigQuery. ( PCA Syntax )

CREATE OR REPLACE MODEL project.dataset.zzz_pca_example
) AS
, Ave_OE_Gestational_Age_Wks
FROM bigquery-public-data.sdoh_cdc_wonder_natality.county_natality

The code above will train the model. I have not extensively tested this on large datasets, but it seemed to run quickly.

The PCA in BQ algorithm comes with two methods to find the eigenvalues and the data matrix.

FROM ML.PRINCIPAL_COMPONENT_INFO(MODEL project.dataset.zzz_pca_example)

FROM ML.PRINCIPAL_COMPONENTS(MODEL project.dataset.zzz_pca_example )


Now I will show you the simple method to apply the PCA back to your original data.

ML.PREDICT(MODEL np-quotecenter.temp_kxe.zzz_pca_example,
, Ave_OE_Gestational_Age_Wks 
, County_of_Residence
  FROM bigquery-public-data.sdoh_cdc_wonder_natality.county_natality
,STRUCT(true as keep_original_columns)

Adding in that final STRUCT produces the original data and the transformed data side-by-side

So if you plot these out you can see the transformation. I have added the trendlines. You can see that the new trendline follows the average on the y axis.

The above code is sufficient to give you everything you need to do PCA in Bigquery SQL.


I will also take you through the DIFFICULT way if making the same data using only the components. Using the output from the first two parts of BQ we can reconstruct the values above without calling the processes. Note that in this example my unit length was one, so I omitted by scaling by that vector. You may need to do that in other

with decomp as (
    max(case when (principal_component_id = 0 AND feature = 'Ave_Birth_Weight_gms') then numerical_value else null end) as Ave_Birth_Weight_gms_0
    ,max(case when (principal_component_id = 0 AND feature = 'Ave_Pre_pregnancy_BMI') then numerical_value else null end) as Ave_Pre_pregnancy_BMI_0
    ,max(case when (principal_component_id = 1 AND feature = 'Ave_Birth_Weight_gms') then numerical_value else null end) as Ave_Birth_Weight_gms_1
    ,max(case when (principal_component_id = 1 AND feature = 'Ave_Pre_pregnancy_BMI') then numerical_value else null end) as Ave_Pre_pregnancy_BMI_1
    from (
      FROM ML.PRINCIPAL_COMPONENTS(MODEL project.dataset.zzz_pca_example )
, evector as (
    FROM ML.PRINCIPAL_COMPONENT_INFO(MODEL project.dataset.zzz_pca_example)
-- , vectorlength as (
--    Select 
-- i.*
-- ,sqrt(POW(Ave_Birth_Weight_gms_0,2) + POW(Ave_Birth_Weight_gms_1,2)) as vector1_length
-- ,sqrt(POW(Ave_Pre_pregnancy_BMI_0,2) + POW(Ave_Pre_pregnancy_BMI_0,2)) as vector2_length
-- from info i 
-- )
, meanadjusted as (

    Select * except(WindowAvgBirthWeight,WindowAvgPregBMI)    
    ,round(Ave_Birth_Weight_gms-WindowAvgBirthWeight,4)  as MeanAdjustedBirthWeight
    ,round(Ave_Pre_pregnancy_BMI-WindowAvgPregBMI,4)   as MeanAdjustedPregBMI    
    from (
      ,round(Ave_Birth_Weight_gms,4) as Ave_Birth_Weight_gms
      ,round(Ave_Pre_pregnancy_BMI,4) as Ave_Pre_pregnancy_BMI
      ,round(avg(Ave_Birth_Weight_gms) over () ,4) as WindowAvgBirthWeight
      ,round(avg(Ave_Pre_pregnancy_BMI) over () ,4) as WindowAvgPregBMI  
      FROM `bigquery-public-data.sdoh_cdc_wonder_natality.county_natality`
, Ave_Birth_Weight_gms_0*MeanAdjustedBirthWeight + Ave_Pre_pregnancy_BMI_0*MeanAdjustedPregBMI as PCA_Dim1
, Ave_Birth_Weight_gms_1*MeanAdjustedBirthWeight + Ave_Pre_pregnancy_BMI_1*MeanAdjustedPregBMI as PCA_Dim2
from meanadjusted ma
join decomp d on 1=1 

Results of the difficult way:

My Data Sci-fi Novel

So I managed to publish a novel. No Lack of Sunshine is available at Amazon. It is a novel about the first-born person from a data seedship that travels to Lellande-21185. It has robots raising children, cellular agriculture, Data Science, Exoplanets, <null values> and more.

How does a person learn about empathy when raised by rigid robots?

Find out here:

Print, Kindle, or Audiobook!

Bigquery Connector in KNIME (2021 Update)

I still use KNIME. I use the heck out of bigquery too. But I still use KNIME too. KNIME allows me to do a few transformations very quickly and I can stuff two Excel sheets into one book very easily with low fuss. So I got a new computer (I threw tea all over my last one and fried it) and needed to set-up my KNIME connect to BigQuery.

Previously my other post Google BigQuery Connector in KNIME , though possibly still accurate, set up the method and required a in-browser authentication. Though this is still functional in my old KNIME flows, the in-browse auth is a bit cumbersome. Instead I took on a new challenge and used the Google Bigquery authorization that now comes standard with KNIME.

Step 1: File > Install KNIME Extensions

After I installed the above extension I restarted KNIME.

Step 2: Download JDBC from Google.

I scrolled down and downloaded the “JDBC 4.2-compatible” version and placed that in my KNIME folder.

Step 3: Change KNIME Preferences

Press File > Preferences > Database > Add.

Give the driver a name and an ID. I choose Bigquery and 1. Not very original I kow. Next under Database Type choose “bigquery” ,which should now be in the drop-down menu due to step 2. Select that. Then press “Find Driver Classes” and com.simba.googlebigquery.jdbc.Driver should populate. The URL template syntax will help you but I typed in jdbc:bigquery://https://www.googleapis.com/bigquery/v2:443/;ProjectId=xxxxx where XXXX is your BigQuery Project ID.

Step 4: Google Console.

Go to Google Console and specifically https://console.cloud.google.com/iam-admin/serviceaccounts?project=xxxx where xxxx is your project ID. Follow the prompts and give your new service account the correct rights to read from GCP. The account should look something like this: xxxxx@projectxxxx.iam.gserviceaccount.com

Step 5: Set up a P12 Key.

You should now see a new user in your Service Accounts menu. Now we need to add a key. Click on Service Accounts > User > Keys > Add Key (or create) and then choose P12 Key. Create and save that P12 into your KNIME folder. It should look something like projectxxxx-######-########.p12 with # being hexadecimal.

Step 6: Point KNIME to that P12.

Like the title says, just point to that P12 file with your service account email that you created in step 4

Step 7: Grab these Nodes into the workflow:

Now go back to KNIME. Enter in “google auth” into the search box of the node repository and drag on “Google Authentication.” From here you should be able to add the BigQuery connector that you installed in Step 1 and the normal DB Query Reader.

‘Flat Earth’ data is a joke. It was earth data, but I wanted it more ‘flat’

The Google Bigquery Connector should now require minimal tweaks of the dropdowns.

Tada, now you have data from GCP BigQuery into KNIME.

Science Fiction Now: Foods from Machines

One of the recent books I have read, (and thoroughly enjoyed) was Aurora by Kim Stanely Robinson. I won’t talk about the book much, but one of the funnier criticisms left on Amazon was loosely: On a spaceship with hundreds of years of Earths information and 3d printers, why couldn’t they just print food? I have linked it here, but beware there are numerous spoilers in this review, so click on it at your own risk. However since ecology and production food were strong themes in the book, the criticism is valid.

Will we as humans be able to print food nutritious food in the future? I was interested in the question and started doing research for my forthcoming book. I came across a number of articles and companies producing human digestible food without the traditional step of relying on multi-cellular animals or plants. Although high-tech farming methods such as hydroponics will still be utilized on spaceships in the future, it will likely be for carbon absorption. Food production will not be their primary goal. After doing this research it altered my novel’s path because I think non-farmed food is very close becoming Science Fiction Now.

Currently Mainstream Products:

I’m sure we’ve all seen the various protein supplements and vitamins available at any retail location. Powdered protein and vitamins are long-standing products. One product, which has existed for less than ten years now is Soyent. Very few of us have actually tried it. I have had the fortunate opportunity to know someone who has consumed it and he said it was perfectly fine as a meal supplement. He continued to consume it off and on over the course of the year. This product attempted to be a full replacement for any meals and advertised that it is “complete nutrition”. Not to denigrate the company, they did have recalls of their product due to unfortunate human response to another company’s algal flour. Meanwhile they continue to use algal fats (based on Wikipedia) which do not seem to have the same problems as the flour. So by using mostly algae, they have been able to create a mostly complete food product.

Beyond Meat and Impossible Burger:
Both these companies offer a variety of meat-analogs that can greatly lower the amount of raw materials required to produce acceptable meats replacements. Impossible Burger claims that their product results in: 96% Less Land 87% Less Water and 89% Fewer GHG Emissions (which we can assume is a proxy for energy). It may look like the company is fairly impressive, but the reality is that meat production is just very inefficient as a food source. Any future technology that will create food from machines will automatically achieve nearly two orders of magnitude efficiency using existing meat analog technology. Normal agriculture converted to meat analogs can already yield over an order of magnitude difference.

There are numerous companies already producing algae based products, and this seems like a mature industry. The benefits of this product is that algae can be modified, pumped, processed, in what seems to be a continuous manner. Better examples are here showing actual algal products being grown in a closed bioreactor. It seems like this industry is what aquaculture was just twenty years ago.

Nutritional Yeast:
Again even though this is also very mainstream, I wanted to add it since I didn’t know enough about this miracle until doing research for my book. Glucose + Saccharomyces cerevisiae = Carbs, Fiber, Protein, and all nine amino acids that humans need! It already sounds very futuristic but rather mundanely already in production. Sugar and single cell organisms can yield an almost complete nutrition? This doesn’t seem fiction at all.

Near Future Products:

Cellular Agriculture:
This is a term I hadn’t come across before writing this blog, but it incorporates the types of research I have outlined below. Most of these are not fully mainstream but are in the start-up phase.

Cultured Meat
aka Artificial Meat aka lab-grown meat (other akas) . Apparently the idea of artificial meat goes back more than a century and is mentioned in a science fiction book called Two Planets, which seems to have not had an English version until 1971. The concept is simple: produce meat without the long inefficient processes of pregnancy, growth, mobility, maturation, and eventual slaughter. The numbers here look similar to meat analogs with 98% less land usage and 45% less energy usage. I don’t see water usage, but I image it would be even less water usage than others since we’re cutting out plant respiration and perspiration needed to grow the plants let alone the water needed to hydrate the cattle. The throw in extra an “science fiction now” flair they have cultured such meat on the International Space Station. Definitely a near-term reality. There are several companies already doing this project so I won’t mention them.

New Culture Food:
Microbes to cheese? Yes Please. Well this company claims their process can make casein protein from microbes. Their website also purports the process greatly reduces all types material usage of when making microbe-cheese vs ordinary methods. I did a brief check of their website, and two numbers listed both show more than an order of magnitude reduction in primary material. Firstly water usage: “… one litre of milk requires approximately one thousand litres of water.” That does not account for actual cheese making where you press the milk into whey and cheese. So lets just chalk that up to at least three orders of magnitude less water usage. Another number listed is “dry matter” which I assume means feed. To produce 1 kg of protein they claim normal dairy production takes 75 – 300 kg of dry matter. That’s easily another order of magnitude a half in savings even if . Contrary to any aforementioned sci-fi books, there’s not going to be any cattle in space (even dwarf cows). No Space Cows!

Finless Fish
Plus +1 to this company for the name. I love it. But they are making real fish protein from fish cells. They were short on details on the amount of savings, but essentially they are duplicating the same physiological processes that make fish, but doing it outside of the fish.

Current Research:

Okay so we can get protein from microbes, meat from stem cells, and nearly balanced food from algae, but I want to do it even better than what reality offers. That is why I found some very cool technology being developed. These researches are trying to beat nature at it’s own game, yes more efficient photosynthesis.

Bioreactor for Vinegar:
So for this first article I found purports to take CO2, Sunlight, Bacteria and nanowires, (and time) to produce acetate (which is close to vinegar). The research paper is here.

biohybrid reactor
“For the past eight years, the researchers have been working on a hybrid system combining bacteria and nanowires that can capture the energy of sunlight to convert carbon dioxide and water into building blocks for organic molecules.”

I assume that you can take those acetate molecules and convert into longer chain molecules.  We’re still a long way off from automated taco factories, but this appears to be a realistic technology.

Food from the Fins:
In this article from the Lappeenranta University of Technology (LUT) and VTT Technical Research Centre of Finland, the Fins claim to be able to produce protein, carbohydrates, fats, and nucleic acids using (apparently) only electricity and carbon dioxide. And magic? They didn’t go into details but I assume other materials go into it. They also claim to be 10x more efficient than photosynthesis. Sounds like a pretty bold claim, but there is another order of magnitude on top of what nature can already do.


This is all pretty interesting stuff to chew on. I don’t doubt that in the future there will be an abundance of food products, even if none of the above technologies take off. The green revolution in the last eighty years has already seen great abundances of production. For example, Iowa creates more corn now than the entire United States did in 1942. And do you know what we do with all that extra corn? We feed it into cars and cows! There is no chance of ‘overpopulation’ and the starving masses tropes from sci-fi, at least from a food production standpoint. Food distribution, sure, but we have so much food waste in the United States that I imagine future colonies of people in outer space will rarely if ever starve. We could already be several orders of magnitude more efficient with food production using current or near-future technology. That trope is dead. Long live artificial food!

A more interesting theme might be the people who clean or restock the food machines that will likely exist in the future, and how they manage to pass on something from machine to machine poisoning the populace or even corroding the machines.

Principle Component Analysis in SQL with Google Bigquery Part 2. Finding the eigenvectors, Matrix Multiply, and checking.

UPDATE: BigQuery can do this all more efficiently via a Stored Procedure. More to come late 2021.

In the last post, Principle Component Analysis in SQL with Google Bigquery. Part 1 I used SQL to find the eigenvalues.¬† Now I will use the temp table I made there and demonstrate how to apply the transformation back to the original data.¬† If you didn’t read that post or have your own eigenvalues and different data, then you should be able to apply this next part to your own data without reading the post.¬† My data is stored in a temp table called projectid.datasetid.zzz_example_table‚Ä謆and my eignvalues are saved in a different table called projectid.datasetid.zzz_example_eigenvalues‚Äč both of which are just generic placeholders.¬† I am not going into any detail on why/how, as the math behind it can be derived from this excellent paper by Lindsay I Smith.¬†¬†http://www.iro.umontreal.ca/~pift6080/H09/documents/papers/pca_tutorial.pdf

Step 1:  Finding the resulting eigenvectors and the length of each vector.

In the query below, I am taking the two eigenvectors found in the previous post, the covariance matrix, and then subtracting the eigenvectors from the diagonal, setting x2 equal to one, and then solving for x1 for each eigenvalue.  Then I am calculating the length for each vector.

CREATE OR REPLACE TABLE `projectid‚Äč.datasetid.zzz_example_evectors` as (
with eval1 as (
max(EigenValue) as EigenValue1
from `projectid‚Äč.datasetid.zzz_example_eigenvalues`
where EigenValueID = 1
,eval2 as (
min(EigenValue) as EigenValue2
from `projectid‚Äč.datasetid.zzz_example_eigenvalues`
where EigenValueID = 2
,CovarMatrix as (
Select --this should be a saved table since we're using it twice, but whatever
round(COVAR_POP(mother_age,mother_age),5) as ParentAge_Covar11
,round(COVAR_POP(father_age,mother_age),5) as ParentAge_Covar12
,round(COVAR_POP(mother_age,father_age),5) as ParentAge_Covar21
,round(COVAR_POP(father_age,father_age),5) as ParentAge_Covar22 
from `projectid‚Äč.datasetid.zzz_example_table`
where father_age <> 99
,SQRT(POW(evector_11,2)+POW(evector_12,2)) as Vector1_Length
,SQRT(POW(evector_21,2)+POW(evector_22,2)) as Vector2_Length
from (
-1.00*(ParentAge_Covar12)/(ParentAge_Covar11 - EigenValue1) as evector_11
,1.00 as evector_12 --we are forcing this value to one and solving for the above using the first eigenvector
,-1.00*(ParentAge_Covar12)/(ParentAge_Covar11 - EigenValue2) as evector_21
,1.00 as evector_22 --we are forcing this value to one and solving for the above using the seond eigenvector
from CovarMatrix,eval1,eval2
) a

Next I will divide the eigenvectors by the vector length to re-scale the data and get the final matrix, that we will then use to transform the original data using matrix multiplication.  Again I will reference the paper above as invaluable to knowing this step.   These lengths are Vector1_Length and Vector2_Length in the SQL above.  Combing these in the query below will produce a decomposition matrix similar to what I would get out of KNIME.

* from 
1 as RowID
,evector_11/Vector1_Length as Col_1
,evector_12/Vector1_Length as Col_2
from `projectid‚Äč.datasetid.zzz_example_table`

union all

2 as RowID
,evector_21/Vector2_Length as Col_1
,evector_22/Vector2_Length as Col_2
from ‚Äčprojectid‚Äč.datasetid.zzz_example_table`
) order by 1 asc
RowID Col_1 Col_2
1 0.6400828 0.76830594
2 -0.7683052 0.64008368

Output from the KNIME PCA Compute module:


Now I will take the original table data, find the means for mother_age and father_age in the tbl_avg CTE, then I will take the row’s value and subtract the mean_adjusted age from the rows.

CREATE OR REPLACE TABLE `‚Äčprojectid‚Äč.datasetid.zzz_example_final_transform` as (
with tbl_avg as (
round(AVG(mother_age),6) as avg_mother_age
,round(AVG(father_age),6) as avg_father_age
from `‚Äčprojectid‚Äč.datasetid.zzz_example_table`
, matrix as (
evector_11/Vector1_Length as Decomp_11
,evector_12/Vector1_Length as Decomp_12
,evector_21/Vector2_Length as Decomp_21
,evector_22/Vector2_Length as Decomp_22
from `‚Äčprojectid‚Äč.datasetid.zzz_example_evectors`
,Decomp_11*mean_adjusted_mother_age+Decomp_12*mean_adjusted_father_age as PCA_Dim1
,Decomp_21*mean_adjusted_mother_age+Decomp_22*mean_adjusted_father_age as PCA_Dim2
from (
mother_age ,mother_age-avg_mother_age as mean_adjusted_mother_age
,father_age ,father_age-avg_father_age as mean_adjusted_father_age
,Decomp_11 ,Decomp_12
,Decomp_21 ,Decomp_22
from `‚Äčprojectid‚Äč.datasetid.zzz_example_table` a, tbl_avg b, matrix m
) a

Finally I will run the correlation (CORR() in BigQery) and then run the correlation for the new data to verify that the resulting transformed dataset has zero correlation.

round(corr(mother_age,father_age),6) as Original_data_Correlation
,round(corr(PCA_Dim1,PCA_Dim2),6) as PCA_Transformation_Correlation
from `‚Äčprojectid‚Äč.datasetid.zzz_example_final_transform`
Original_data_Correlation PCA_Transformation_Correlation
0.784335 2.00E-06

As you can see above, the original correlation was .78, but now after this transformation the correlation between the two transformed dimensions is very close to zero.  This is probably due to the rounding and approximation of the solution found by using the where clause as a solver.  In addition I have visualized the results from the zzz_example_final_transform table in Tableau Public with a loose description of  the two PCA dimensions.


PCA in Tableau.PNG

Principle Component Analysis in SQL with Google Bigquery. Part 1 Finding the covariance matrix and eigenvalues.

UPDATE: BigQuery can do this all more efficiently via a Stored Procedure. More to come late 2021.

Hello again, it’s been a bit of time since my last post, but I have not been idle.¬† The reality is that most of my efforts have been directed fully toward my company’s needs and our move over to Google BigQuery.¬† I haven’t developed many more capabilities that I can share publicly.¬† KNIME & GoogleBigQuery have been great enabling me to move very fast with data and produce a few proof-of-concepts.¬† Just to think that maybe it was only 8 years ago I was doing everything in Microsoft Access and Excel.¬† Gone are those days.¬† Microsoft Excel is now just my Notepad.

One thing I do not have yet is the ability in KNIME to upload data back into BigQuery.  Perhaps that will be a capability soon, but BigQuery is all flush and fill, and KNIME seems to need a batch ability.  So long-story-short I wanted to do Principle Component Analysis on a dataset and transform two variables into their principle components but have it scheduled in Google BigQuery so it will automatically refresh when new data arrives.  So I reduced the problem into one totally solvable in BQ SQL.  I will not go into depth about why you need to do PCA, what the covariance matrix tells us about the data, nor how to interpret the eigenvalues or loadings.  That horse has been beat with a stick by people far more capable than me.  This is purely a discrete approximation implemented in SQL.  I am indebted to Lindsay I Smith who wrote this paper here:

Step 0:  Creating the test dataset and finding correlations.

Using the natality public data in Google BigQuery, I have created a dataset with at least two columns I assume are very correlated: mother_age and father_age.

CREATE OR REPLACE TABLE `projectid.datasetid.zzz_example_table` as (
FROM `bigquery-public-data.samples.natality` 
where year >= 2005
and father_age <> 99
--order by year desc
LIMIT 20000

Then I used Google BigQuery’s function called Corr to see the correlations between columns and as expected, mother_age and father_age are correlated.

CORR(mother_age,father_age) as ParentAge_Correl
from ‚Äčprojectid.datasetid.zzz_example_table`
where father_age <> 99

… and as expected, those two columns are correlated:¬†¬†ParentAge_Correl :¬†0.78433544¬† Highly correlated variables lead me to believe I will find a strong component from them.

Step 1:  Creating the Covariance Matrix.

Google BigQuery has done a great job on their statistical functions, and similar to the correlation function utilize above, I will first make a covariance matrix using the tools already made available.¬† You may want to make this a table as it will be used more than once, however the dataset is so small and BigQuery so fast that I didn’t see the need to.

round(COVAR_POP(mother_age,mother_age),5) as ParentAge_Covar11
,round(COVAR_POP(father_age,mother_age),5) as ParentAge_Covar12
,round(COVAR_POP(mother_age,father_age),5) as ParentAge_Covar21
,round(COVAR_POP(father_age,father_age),5) as ParentAge_Covar22 
from `projectid.datasetid.zzz_example_table`
where father_age <> 99

I am going to treat this as one row, with the suffix “*_RowIdColumnId” on any fields that are part of a matrix and starting the index at one.¬† One could make an actual matrix out of this using a union.¬† The it produces this output:

ParentAge_Covar11 ParentAge_Covar12 ParentAge_Covar21 ParentAge_Covar22
39.26018 35.54529 35.54529 52.31282

In addition, I have rounded everything on this matrix to the 5th decimal place.  If your data needs more precision or is very small, you will need to adjust accordingly, but this is accurate enough for my needs.

Step 2: Creating a dataset of possible values to solve for the eigenvalues.

My personal familiarity with the subject matter leads me to believe that human fecundity in this current technological age ranges between teenage for mother and to maybe sixties for the father and positively correlated and as a result I expect my eigenvalues to be between 0 and 100 years of age.¬† These are very big assumptions.¬† I encourage you to run it through KNIME’s various modules for data you do not know very well.¬† The below code will generate 1 million rows from 0.0000 to 99.9999, but this is super easy in BigQuery.¬† What’s a million rows these days?

with nbrs as (
Select 1 as nbr union all Select 0 union all Select 2 union all Select 3 union all 
Select 4 union all Select 5 union all Select 6 union all Select 7 union all Select 8 union all Select 9
valuez as (
Select round(tens.nbr*10.0+ones.nbr*1.0+tenths.nbr/10.00+ hundies.nbr/100.00 + thousies.nbr/1000.00 + tenthousies.nbr / 10000.00,5) as value
from nbrs tens
join nbrs ones on 1=1
join nbrs tenths on 1=1
join nbrs hundies on 1=1
join nbrs thousies on 1=1
join nbrs tenthousies on 1=1 
order by 1 desc
Select * from valuez

If your eigenvalues fall outside the range of this specific example, adjust accordingly.

Step 3:  Using the SQL where clause as a math solver.

Somebody on the internet is training people to use functions to create smart indices on SQL databases.¬† This was pretty novel in my opinion and he saved the left hand side of a¬† function to be an index and used the where clause as a solver.¬† That’s pretty bad-ass if you ask me.¬† I will attempt to utilize that method.¬† Now I am going to combine all the previous steps as common table expressions (CTE) and then solve for the eigenvalue.¬† I will select a long list of values I’m certain the eigenvalue exists in a table statement, Cartesian joining the covariance matrix, allowing for a rather loose threshold of being between -0.0500 and 0.0500 in the where clause, finally¬†grouping those possible answers and ordering them by accuracy and then finding the two most accurate answers.¬† Note! depending on the accuracy you need and the rounding you select, you may get back zero, one, or two eigenvalues.¬† Play around with the rounding and adjust as needed.

CREATE OR REPLACE TABLE `projectid.datasetid.zzz_example_eigenvalues` as (
with nbrs as (
Select 1 as nbr union all Select 0 union all Select 2 union all Select 3 union all 
Select 4 union all Select 5 union all Select 6 union all Select 7 union all Select 8 union all Select 9
valuez as (
Select round(tens.nbr*10.0+ones.nbr*1.0+tenths.nbr/10.00+ hundies.nbr/100.00 + thousies.nbr/1000.00 + tenthousies.nbr / 10000.00,5) as value
from nbrs tens
join nbrs ones on 1=1
join nbrs tenths on 1=1
join nbrs hundies on 1=1
join nbrs thousies on 1=1
join nbrs tenthousies on 1=1 
order by 1 desc
,CovarMatrix as (
round(COVAR_POP(mother_age,mother_age),5) as ParentAge_Covar11
,round(COVAR_POP(father_age,mother_age),5) as ParentAge_Covar12
,round(COVAR_POP(mother_age,father_age),5) as ParentAge_Covar21
,round(COVAR_POP(father_age,father_age),5) as ParentAge_Covar22 
from `np-quotecenter.temp_kxe.zzz_example_table`
where father_age <> 99
,DENSE_RANK() over (order by EigenValue desc) as EigenValueID
from (
,dense_rank() over (order by abs(Accuracy) asc) as AccuracyValueRank
from (
,round(POW(EigenValue,2)-Lambda1*EigenValue+integervalue,5) as Accuracy
from (
valuez.value as EigenValue
,ParentAge_Covar11*ParentAge_Covar22-ParentAge_Covar21*ParentAge_Covar12 as integervalue
,ParentAge_Covar11+ParentAge_Covar22 as Lambda1
from valuez,CovarMatrix
-- union all --not sure if negatives are needed. Distant Math memories seem in tell me yes eigenvectors can be negative
-- Select 
-- -1*valuez.value 
-- ,ParentAge_Covar11*ParentAge_Covar22-ParentAge_Covar21*ParentAge_Covar12 as integervalue
-- ,ParentAge_Covar11+ParentAge_Covar22 as Lambda1
-- from valuez,CovarMatrix
) a
where round(POW(EigenValue,2)-Lambda1*EigenValue+integervalue,5) between -0.0500 and 0.0500
group by 1,2,3,4
--order by 4 desc
) a
order by AccuracyValueRank asc
) aa
where AccuracyValueRank = 1

I am grouping the values, finding the numbers that get me the closest to zero and then selecting those two.


If you run the above code in Google BigQuery, you will get these results:

Row EigenValue AccuracyValueRank EigenValueID
1 81.926 1 1
2 9.647 1 2

How close were those eigenvalues compared to the real ones?

Row KNIME Eigenvalue SQL Eigenvalue % Diff
1 81.930053 81.926 0.0049%
2 9.6475241 9.647 0.0054%

The KNIME eigenvalue came from the PCA compute module, and a quick check confirms they are very close to each other.


Using the where clause as a solver is not the most ideal solution, but it works well for now.¬† A Future application is the use an actual factorization formula to find the eigenvalues accurately and without the million rows, but it worked well enough for this example.¬† I’m not going to sugar-coat this, it is possible that if the eigenvalues are very close then the above method will not work.¬† I used two columns I knew were very correlated and likely Fathers age > Mothers age.¬† I believe you’ll have to uncomment the negative¬† union all valuez part to find those.

Next week I will calculate the rest of the matrix you see in the KNIME image above, take the original data and transform it into the new principle component dimensions.  Then I will use the Corr() function in BigQuery to check and make sure the new PCA components are not correlated.

Google BigQuery Connector in KNIME

KNIME.Com posted a method for connecting a KNIME workflow to Google Big Query:  https://www.knime.com/blog/GoogleBigQuery_meets_SQLite

This article blended SQLite and BigQuery and datasources.  In my life, KNIME has been a combination of a large roll of duct-tape, bailing wire, coupled with a Leatherman & Swiss Army Knife.  It is a tool to connect anything to anything else and keep all that embedded logic and enables analytics to occur across any dataset with any tool.  The BigQuery connector in their example did not quite work out-of-the-box for me as they had it set up in their article.  I took their workflow and simplified a few steps and was able to get BigQuery to connect in a KNIME workflow!

Here are the detailed instructions I used to connect.

  1.  Download and install the SIMBA JDBC driver from this webiste.  I used version 4.2 on my machine:  https://cloud.google.com/bigquery/partners/simba-drivers/
  2.  Unzip this into your KNIME folder.
  3. Point KNIME to the directory shown above.  Click on File > Preferences then KNIME > Databases and add the directory
  4. Copy the first two modules in their KNIME example workflow called 10_GoogleBigQuery…¬† in the examples server.¬† Search for “bigquery” in the KNIME Explorer and it will be there.¬† Open that workflow and then save it to your workspace.

    Remove everything but the first two modules to the top left.¬† Then add in the default JDBC “Database Reader” query next to the Database Connector and connect them.¬† Also read the KNIME.com link I posted above.¬† They state that doing the “Credentials Input” module encrypts the password data.
  5. Reconfigure the middle module “Connect to Bigquery DB…”with these settings

    The Database Driver should now have the dropdown for the com.simba.googlebigquery.jdbc42.Drive¬† ¬†replace the database URL with this information replacing¬†“xxxxxx” with your billing project.

  6. Click on that same module “Database Connector” and press play.¬† It will give you a popup with a URL.¬† Copy and Paste that URL into a new web browser (I had to use a browser with only one Google sign-in).¬† Then you will get a screen like this with a special code

    Copy and paste that code xxxxxxx-xxxxxx back into the KNIME dialog box.
    Updated Note:  This step above *must* be done quickly, as the module may time-out by the time you enter your credentials in.
  7. You should now get a green finished module.  In the subsequent Database Reader module, enter this standard SQL query (not legacy) and run it.  This is a simple pull from a public dataset on BigQuery about taxi cab pickups.  This is a super-simple pull of only 10 lines.
    SELECT pickup_longitude, pickup_latitude 
    FROM `nyc-tlc.green.trips_2014` LIMIT 10
  8. Tada!  Data from BQ into KNIME.  Enjoy and make the magic happen!  Data to the people!

Adjustable color Saturation in Tableau

Have you ever wanted to allow your user more choices in coloring a heatmap? ¬†Currently Tableau does not allow the viewer of a visualization to adjust the colors of a visualization. ¬†The user must have the web-edit ability or Tableau Desktop to make major changes to the color scales. ¬†Although with a few tricks you could adjust a dashboard based on a parameter and show different views, I wanted to allow my users more control without those methods. ¬†Specifically I wanted a heat-map where the user can control how much saturation the extreme values have and how much those values visually ‘overpower’ the lesser values. ¬†Conversely, a user could ’empower’ the lower values and equalize them with the outliers.

The first thing I did was make a list parameter, populate it with some numeric values and then more user-fiend “Display as” values. ¬† The reason I did this is to allow the user several options without offloading the entire decision-making process to them. ¬†This will also keep the view consistent over many viewers or iterations, and relieve the user of too many choices. ¬†The colors could go wildly wrong. ¬†Next I added the parameter to the worksheet and made it a slider-type parameter. ¬†Tableau’s Parameters have several options, but to me the ‘slider’ makes sense in situations where there are a few concrete options.
Color Desat Parameter v2

Next I feed these parameters into a new formula that I called “Colorizer”:

[Colorizer] : log([Mag])^[Color Desaturator]

I’ve calculated out the possibilities below in Excel. ¬†(Yes although you *can* do such a chart in Tableau, Excel still is good at what it does best.)


As you can see at the table on the right-side.  The difference between the Min and Max values (diff)  varies wildly depending on the parameter selected.  For this specific dataset based on Earthquake magnitudes the above formula works well, however for your specific data you may need to add in a different base using this formula:

[Colorizer] : log([Mag],[base > 10])^[Color Desaturator]

As you can see below, the colors washout tremendously depending on the parameter selected.  This allows the user to focus on all numbers at the lowest setting, or only the outliers at the max setting.



The unfortunate side-affect of this is that the Tableau default color legend is now *incomprehensible*. ¬†So the dataset I used is on earthquakes, which ranges from 6.0 to a maximum magnitude 9.6. ¬†However look at the default legend, which I have now termed as the “Useless Legend”:

The result is that you will need to create another legend with the same color value (using the [Colorizer] field) but with the actual value (in this case it was the [Mag] field)  on the text description.  Here is an example:

Enjoy this simple trick by downloading and deconstructing the embedded view below



Using R’s Elastic library to connect KNIME to elasticsearch.

I’d like to highlight another great suite of KNIME connectors. ¬†These have been in KNIME for a long time, but due to the incredible number of libraries within R, I can make KNIME do so much more. ¬†I have a situation where I wanted to connect to ElasticSearch and extract JSON data, contort it into¬†a data-frame and then export that into KNIME and eventually into Tableau. ¬†Thankfully R has a lightweight ElasticSearch library called “elasticsearchr”¬†. ¬†I will utilize this in the R Module to extract the data.


Since we are depending on R to generate the data, the first module we are going to use is called ‘R Source (Workspace)‘.


I am passing variables three ElasticSearch variables into this module (red line): server, index, and document.  This is done so I can easily pass this information onto other R modules for re-use.  In this module here is the R script that I am using to query the data and extract a simple count by day for an action.

Without an aggregation:

This should suffice for simple data pulls even among events that may have different json structures:

########### CONSTRUCT the Query ###############
qr1 <- query('
 "bool": {
 "must": [
 "terms": {
 } },
 "range": {
 "data.xxxxxxxx": {
 "gte": "now-1d/d",
 "lt": "now+1d/d"
 } } }
 } }
####### GRAB THE DATA ###########
esdata <- elastic(knime.flow.in[["ES_Server"]],knime.flow.in[["MicroSvc"]],"clientaction") %search% (qr1)

The variable above ‘esdata’ is a list and I handle extraction in a separate module called R to Table:


In which I convert the data into a KNIME table by just addressing the specific data elements I want.

knime.out <- data.frame(

If one of those elements doesn’t exist in the specific JSON document, then KNIME will knidly null (the red ?) that field. ¬†The first term is the¬†“data.xxxx.xxxxx.xxxxxx” event referenced above.


This is the super-simple way to get the data.

With An Aggregation:

When the JSON structure might be more complicated and you only wish to have an aggregation, such as a count by day, then this method below enables that.

########### CONSTRUCT the Query ###############
qr1 <- query('{
 "term": {
 "data.xxxxxxxx": {
 "value": "xxxxx"
########### CONSTRUCT the AGG ###############
ag1 <- aggs('{
 "actions": {
 "terms": {
 "field": "data.xxxxxx","size": 10000
 "aggs": {
 "data.xxxxxxx": {
 "date_histogram": {
 "field": "data.xxxxxxx",
 "interval": "day"
 } } } }}
## Debugging
#print(qr1 + ag1)

esdata2 <- elastic(knime.flow.in[["ES_Server"]]
,knime.flow.in[["DocumentName"]]) %search% (qr1 + ag1)

The aggregation is not needed, and a straight ElasticSearch query can be executed, however if your JSON structure frequently changes it may be difficult to extract the information.

Extraction into KNIME again uses that “R to Table” module but with very different code. ¬†Note that I am open to suggestions about this section as this R code might be very inefficient.

### get the length for the loop ###
maxi <- length(esdata2$key)

### make a list to contain the looped results ###
datalist = list()

### loop through, constructing a data frame at each point and pass that into a list ###
for (i in 1:maxi) {
 df1 <- as.data.frame(esdata2$data.xxxxxxx.buckets[[i]])
 dat<- data.frame(esdata2$key[i],esdata2$doc_count[i],df1$key_as_string,df1$doc_count)
 datalist[[i]] <- dat

### re-extract the list into a data frame ###
big_data = do.call(rbind,datalist)

## export back out to KNIME ##
knime.out <- as.data.frame(big_data)


I have utilized the above workflow and scheduled it to execute automatically every morning about the time I get to work.