Science Fiction Now: Reactionless Drives

We all want to reach the stars. But how realistic it for near-future humanity to actually reach them? I am defining near-future as within 50 years. That’s just the technology time; not the travel time. There is little realistic way to reach the nearest stars without a multi-decade travel time.

The nearest star is Proxima Centauri is about 4.2 light-years away. One light year is approximate 9.4 Trillion KM! So 4.2*times that is almost 40 trillion kilometers to Proxima Centauri. Using Voyager as a benchmark, it left the solar system doing ~61k km/h or approximately 16.985 km/s. Suppose we use some unspecified “new technology improvements” and increase the solar-system exit speed 100x ( two orders of magnitude). How long will it take such probe to reach the nearest star moving at 1698 km/s?

Well this Super-Voyager would travel this incredible distance per year: 53.6 Billion KM / yr. {16.985 km/s * 31.5M Seconds / Year * 100 technology-factor}. So with no more velocity being added in, (nor any subtracted out once we get to the star) then humanity would need 744! years of travel before a souped-up Voyager-esque spacecraft would get to another star. To give a little more context to those years, the cathedral Sagrada Família started construction in 1882 and is expected to be completed in a decade. Thus society would need to commit to something at least 5-times longer than a century spanning cathedral. A city in German has completed three decades of a 1300 year public art installation called the time pyramid. Both are examples of society committing resources to projects that they will never see realized.

So clearly we need a system that both accelerates during the flight-time in order to get a spaceship to another star in 100 time frame. If we are able to continuously accelerate during the trip, we might also have the ability to also decelerate when we get to the star.

I will not be covering warp drives as I am trying to keep this down to realistic levels even though the math works.

Using a Reaction Drive

A reaction drive is just another name for a normal rocket engine that you would normally associate with space travel. Pushing a bit of mass faster one direction results in a proportional amount of mass being pushed the opposite direction. More information linked here, but this Wikipedia image is easy to comprehend for visual thinkers:

Long story short (but very meta): In order to accelerate the ship, you need to accelerate the fuel to accelerate the fuel, you later need to accelerate the ship. So to accelerate the ship from 0 to 1 on the chart below you need a mass ratio of ~4:1. 0 to 2 ~8:1, 0 to 3 ~20:1.

A linear increase in deltaV needs a geometric increase in mass.

Society has been able to accomplish human spaceflight with this geometric problem, but in order to realisticly travel to another star (with less than a few-hundred year life-time) we need something else.

So why a Reationless Drive?

First of all, I have come across the term “Propellentless Propulsion” as a synonym to Reactionless Drive. I have to admit I love the term. Catchy names and alliteration go a long way to selling an idea. For the sake of this blog I will keep using the original term.

A reactionless drive is a space propulsion device that does not require any fuel carried on board to be propelled. Proposals include solar sails, laser powered sails and other more exotic drives. So how close is humanity to creating a true reactionless drive that is capable of traveling interstellar distances? Not very close, but some engineering work has been accomplished (even in space).

Proposals using known Physics (or studied extensively)

Bussard ramjet:
This is one of the more popular proposals and has been studied for decades. This isn’t truly a reactionless drive, but achieves the same result. Instead of bringing the fuel with us, can we collect it in a magnetic scoop and burn it along the way? I won’t go into much detail, but instead point you the this article on Centauri Dreams. Short answer is ‘no’ that the vehicle would not be able to accelerate into the solar wind.

Photon Rocket:
This rocket is almost like a flashlight in space. Okay it is a little more complicated than that, however it is the most realistic proposal as it relies on no speculative science and uses established technology. One problem though is that the maximum speed that can be achieved right now (no fusion drives, just nuclear fission) is about 0.02% the speed of light or 60 km/s. Which is like a Jupiter assist at the edge of our solar system. Not enough for interstellar speed.

Laser Power and Solar Sails:
This concept is one of the most realistic proposals for interstellar flight. However, I was originally mistaken when I first thought about the idea. At first impression, a true solar sail would not reach interstellar speeds with any reasonable time frame (~100 years). Even if it did, it seemed like it would need another star of equal size to travel to if it wanted to stop. Laser powered craft could have accelerated it just enough to have it then decelerate by the continual pressure of the star, but the long deceleration time did not make the flight realistic. After researching it a little more, there were interesting proposals for reflecting the laser light back on the spacecraft using a mirror. The mirror would continue to accelerate, but it could slow down the spacecraft half way.

So that makes these proposals more most realistic for interstellar travel. Another check int heir box is the fact that they have been studied and tested in actual space. Everything from the Kepler Space Telescope to the IKAROS spacecraft to a future proposal used or plan to use solar radiation pressure to maneuver spacecraft.

The science works, the engineering has been tested, and it propels spacecraft! Take us to the STARS!

Electric Sail / Dipole Drive.
This proposal is currently in the development stage. From my research I have only read of small one satellite to use this proposal. “Space” is not empty. There are numerous charged particles flying around. These proposals use designs to propel the particles flying through space a certain direction and thus propel the spacecraft in the opposite direction.

Most of the information I am using comes from this excellent video by Dr. Robert Zubrin:

A second updated podcast (June 2020) is available here (~30 minutes in). Basically with a double-sail electric device we can propel the neutrons in space in one direction, and the electrons will be propelled in the opposite direction. However since the mass difference is almost 2 orders of magnitude (an electric is about 2% of a neutron) and interstellar space has a relative equal amount of both, a spacecraft can push against the interstellar medium. The double grids can be maintained at their proper spacing with spin alone, though I suspect it is possible that some force will be needed to maintain revolutions over the vast interstellar distances. There might be a method to deflect in such a way to impart spin (not sure how that would work without adding more mass to deform the grids.

Now the biggest roadblock is not the ability or the physics, its power. Dr. Zubrin mentions this one of the videos, but he considers beamed power to be a valid option. Still, even with a small mass and unlimited power from beaming, I needing to increase the speed and decrease the spaceship’s mass by at least 10x -100x each to achieve a ‘more realistic’ ~100 year performance.

Another Proposal using Unknown Physics:

EM Drive:
I have added this to let people know that yes sometimes engineering (aka reality) can inform pure science. Is it possible that engineering can create something without the science being confirmed prior? Yes absolutely. History had many inventions before science could explain how they worked. A few years ago this drive came up in the ‘news’. The initial research produced a very marginal effect. However this was not duplicated. In addition one claim was that any propulsive force measured was really just heat escaping from the drive, and is basically just an inefficient photon rocket at that point. The apparent thrust is far smaller than any testing margin of error. Thus even if it is real, it’s not going to propel us to the stars.

I have put in this speculative drive, to remind the reader that there are still unknown forces in nature. Intuition about the unknown unkowns could be shown in anomalies in thrust from some of our distant probes, and unexplained gains during flybys. Since I am writing in the hard sci-fi genre drives like this are too speculative for me to consider.


So from the proposals I outlined above, I have hopefully shown why I chose the Electric Sail when traveling to another star. Even though I am taking great liberties with the performance in my upcoming book, I still have it taking over 100 years! And that is with a +50x performance over currently existing technology.

Science Fiction Now: Foods from Machines

One of the recent books I have read, (and thoroughly enjoyed) was Aurora by Kim Stanely Robinson. I won’t talk about the book much, but one of the funnier criticisms left on Amazon was loosely: On a spaceship with hundreds of years of Earths information and 3d printers, why couldn’t they just print food? I have linked it here, but beware there are numerous spoilers in this review, so click on it at your own risk. However since ecology and production food were strong themes in the book, the criticism is valid.

Will we as humans be able to print food nutritious food in the future? I was interested in the question and started doing research for my forthcoming book. I came across a number of articles and companies producing human digestible food without the traditional step of relying on multi-cellular animals or plants. Although high-tech farming methods such as hydroponics will still be utilized on spaceships in the future, it will likely be for carbon absorption. Food production will not be their primary goal. After doing this research it altered my novel’s path because I think non-farmed food is very close becoming Science Fiction Now.

Currently Mainstream Products:

I’m sure we’ve all seen the various protein supplements and vitamins available at any retail location. Powdered protein and vitamins are long-standing products. One product, which has existed for less than ten years now is Soyent. Very few of us have actually tried it. I have had the fortunate opportunity to know someone who has consumed it and he said it was perfectly fine as a meal supplement. He continued to consume it off and on over the course of the year. This product attempted to be a full replacement for any meals and advertised that it is “complete nutrition”. Not to denigrate the company, they did have recalls of their product due to unfortunate human response to another company’s algal flour. Meanwhile they continue to use algal fats (based on Wikipedia) which do not seem to have the same problems as the flour. So by using mostly algae, they have been able to create a mostly complete food product.

Beyond Meat and Impossible Burger:
Both these companies offer a variety of meat-analogs that can greatly lower the amount of raw materials required to produce acceptable meats replacements. Impossible Burger claims that their product results in: 96% Less Land 87% Less Water and 89% Fewer GHG Emissions (which we can assume is a proxy for energy). It may look like the company is fairly impressive, but the reality is that meat production is just very inefficient as a food source. Any future technology that will create food from machines will automatically achieve nearly two orders of magnitude efficiency using existing meat analog technology. Normal agriculture converted to meat analogs can already yield over an order of magnitude difference.

There are numerous companies already producing algae based products, and this seems like a mature industry. The benefits of this product is that algae can be modified, pumped, processed, in what seems to be a continuous manner. Better examples are here showing actual algal products being grown in a closed bioreactor. It seems like this industry is what aquaculture was just twenty years ago.

Nutritional Yeast:
Again even though this is also very mainstream, I wanted to add it since I didn’t know enough about this miracle until doing research for my book. Glucose + Saccharomyces cerevisiae = Carbs, Fiber, Protein, and all nine amino acids that humans need! It already sounds very futuristic but rather mundanely already in production. Sugar and single cell organisms can yield an almost complete nutrition? This doesn’t seem fiction at all.

Near Future Products:

Cellular Agriculture:
This is a term I hadn’t come across before writing this blog, but it incorporates the types of research I have outlined below. Most of these are not fully mainstream but are in the start-up phase.

Cultured Meat
aka Artificial Meat aka lab-grown meat (other akas) . Apparently the idea of artificial meat goes back more than a century and is mentioned in a science fiction book called Two Planets, which seems to have not had an English version until 1971. The concept is simple: produce meat without the long inefficient processes of pregnancy, growth, mobility, maturation, and eventual slaughter. The numbers here look similar to meat analogs with 98% less land usage and 45% less energy usage. I don’t see water usage, but I image it would be even less water usage than others since we’re cutting out plant respiration and perspiration needed to grow the plants let alone the water needed to hydrate the cattle. The throw in extra an “science fiction now” flair they have cultured such meat on the International Space Station. Definitely a near-term reality. There are several companies already doing this project so I won’t mention them.

New Culture Food:
Microbes to cheese? Yes Please. Well this company claims their process can make casein protein from microbes. Their website also purports the process greatly reduces all types material usage of when making microbe-cheese vs ordinary methods. I did a brief check of their website, and two numbers listed both show more than an order of magnitude reduction in primary material. Firstly water usage: “… one litre of milk requires approximately one thousand litres of water.” That does not account for actual cheese making where you press the milk into whey and cheese. So lets just chalk that up to at least three orders of magnitude less water usage. Another number listed is “dry matter” which I assume means feed. To produce 1 kg of protein they claim normal dairy production takes 75 – 300 kg of dry matter. That’s easily another order of magnitude a half in savings even if . Contrary to any aforementioned sci-fi books, there’s not going to be any cattle in space (even dwarf cows). No Space Cows!

Finless Fish
Plus +1 to this company for the name. I love it. But they are making real fish protein from fish cells. They were short on details on the amount of savings, but essentially they are duplicating the same physiological processes that make fish, but doing it outside of the fish.

Current Research:

Okay so we can get protein from microbes, meat from stem cells, and nearly balanced food from algae, but I want to do it even better than what reality offers. That is why I found some very cool technology being developed. These researches are trying to beat nature at it’s own game, yes more efficient photosynthesis.

Bioreactor for Vinegar:
So for this first article I found purports to take CO2, Sunlight, Bacteria and nanowires, (and time) to produce acetate (which is close to vinegar). The research paper is here.

biohybrid reactor
“For the past eight years, the researchers have been working on a hybrid system combining bacteria and nanowires that can capture the energy of sunlight to convert carbon dioxide and water into building blocks for organic molecules.”

I assume that you can take those acetate molecules and convert into longer chain molecules.  We’re still a long way off from automated taco factories, but this appears to be a realistic technology.

Food from the Fins:
In this article from the Lappeenranta University of Technology (LUT) and VTT Technical Research Centre of Finland, the Fins claim to be able to produce protein, carbohydrates, fats, and nucleic acids using (apparently) only electricity and carbon dioxide. And magic? They didn’t go into details but I assume other materials go into it. They also claim to be 10x more efficient than photosynthesis. Sounds like a pretty bold claim, but there is another order of magnitude on top of what nature can already do.


This is all pretty interesting stuff to chew on. I don’t doubt that in the future there will be an abundance of food products, even if none of the above technologies take off. The green revolution in the last eighty years has already seen great abundances of production. For example, Iowa creates more corn now than the entire United States did in 1942. And do you know what we do with all that extra corn? We feed it into cars and cows! There is no chance of ‘overpopulation’ and the starving masses tropes from sci-fi, at least from a food production standpoint. Food distribution, sure, but we have so much food waste in the United States that I imagine future colonies of people in outer space will rarely if ever starve. We could already be several orders of magnitude more efficient with food production using current or near-future technology. That trope is dead. Long live artificial food!

A more interesting theme might be the people who clean or restock the food machines that will likely exist in the future, and how they manage to pass on something from machine to machine poisoning the populace or even corroding the machines.

Principle Component Analysis in SQL with Google Bigquery Part 2. Finding the eigenvectors, Matrix Multiply, and checking.

In the last post, Principle Component Analysis in SQL with Google Bigquery. Part 1 I used SQL to find the eigenvalues.  Now I will use the temp table I made there and demonstrate how to apply the transformation back to the original data.  If you didn’t read that post or have your own eigenvalues and different data, then you should be able to apply this next part to your own data without reading the post.  My data is stored in a temp table called projectid.datasetid.zzz_example_table​ and my eignvalues are saved in a different table called projectid.datasetid.zzz_example_eigenvalues​ both of which are just generic placeholders.  I am not going into any detail on why/how, as the math behind it can be derived from this excellent paper by Lindsay I Smith.  http://www.iro.umontreal.ca/~pift6080/H09/documents/papers/pca_tutorial.pdf

Step 1:  Finding the resulting eigenvectors and the length of each vector.

In the query below, I am taking the two eigenvectors found in the previous post, the covariance matrix, and then subtracting the eigenvectors from the diagonal, setting x2 equal to one, and then solving for x1 for each eigenvalue.  Then I am calculating the length for each vector.

CREATE OR REPLACE TABLE `projectid​.datasetid.zzz_example_evectors` as (
with eval1 as (
max(EigenValue) as EigenValue1
from `projectid​.datasetid.zzz_example_eigenvalues`
where EigenValueID = 1
,eval2 as (
min(EigenValue) as EigenValue2
from `projectid​.datasetid.zzz_example_eigenvalues`
where EigenValueID = 2
,CovarMatrix as (
Select --this should be a saved table since we're using it twice, but whatever
round(COVAR_POP(mother_age,mother_age),5) as ParentAge_Covar11
,round(COVAR_POP(father_age,mother_age),5) as ParentAge_Covar12
,round(COVAR_POP(mother_age,father_age),5) as ParentAge_Covar21
,round(COVAR_POP(father_age,father_age),5) as ParentAge_Covar22 
from `projectid​.datasetid.zzz_example_table`
where father_age <> 99
,SQRT(POW(evector_11,2)+POW(evector_12,2)) as Vector1_Length
,SQRT(POW(evector_21,2)+POW(evector_22,2)) as Vector2_Length
from (
-1.00*(ParentAge_Covar12)/(ParentAge_Covar11 - EigenValue1) as evector_11
,1.00 as evector_12 --we are forcing this value to one and solving for the above using the first eigenvector
,-1.00*(ParentAge_Covar12)/(ParentAge_Covar11 - EigenValue2) as evector_21
,1.00 as evector_22 --we are forcing this value to one and solving for the above using the seond eigenvector
from CovarMatrix,eval1,eval2
) a

Next I will divide the eigenvectors by the vector length to re-scale the data and get the final matrix, that we will then use to transform the original data using matrix multiplication.  Again I will reference the paper above as invaluable to knowing this step.   These lengths are Vector1_Length and Vector2_Length in the SQL above.  Combing these in the query below will produce a decomposition matrix similar to what I would get out of KNIME.

* from 
1 as RowID
,evector_11/Vector1_Length as Col_1
,evector_12/Vector1_Length as Col_2
from `projectid​.datasetid.zzz_example_table`

union all

2 as RowID
,evector_21/Vector2_Length as Col_1
,evector_22/Vector2_Length as Col_2
from ​projectid​.datasetid.zzz_example_table`
) order by 1 asc
RowID Col_1 Col_2
1 0.6400828 0.76830594
2 -0.7683052 0.64008368

Output from the KNIME PCA Compute module:


Now I will take the original table data, find the means for mother_age and father_age in the tbl_avg CTE, then I will take the row’s value and subtract the mean_adjusted age from the rows.

CREATE OR REPLACE TABLE `​projectid​.datasetid.zzz_example_final_transform` as (
with tbl_avg as (
round(AVG(mother_age),6) as avg_mother_age
,round(AVG(father_age),6) as avg_father_age
from `​projectid​.datasetid.zzz_example_table`
, matrix as (
evector_11/Vector1_Length as Decomp_11
,evector_12/Vector1_Length as Decomp_12
,evector_21/Vector2_Length as Decomp_21
,evector_22/Vector2_Length as Decomp_22
from `​projectid​.datasetid.zzz_example_evectors`
,Decomp_11*mean_adjusted_mother_age+Decomp_12*mean_adjusted_father_age as PCA_Dim1
,Decomp_21*mean_adjusted_mother_age+Decomp_22*mean_adjusted_father_age as PCA_Dim2
from (
mother_age ,mother_age-avg_mother_age as mean_adjusted_mother_age
,father_age ,father_age-avg_father_age as mean_adjusted_father_age
,Decomp_11 ,Decomp_12
,Decomp_21 ,Decomp_22
from `​projectid​.datasetid.zzz_example_table` a, tbl_avg b, matrix m
) a

Finally I will run the correlation (CORR() in BigQery) and then run the correlation for the new data to verify that the resulting transformed dataset has zero correlation.

round(corr(mother_age,father_age),6) as Original_data_Correlation
,round(corr(PCA_Dim1,PCA_Dim2),6) as PCA_Transformation_Correlation
from `​projectid​.datasetid.zzz_example_final_transform`
Original_data_Correlation PCA_Transformation_Correlation
0.784335 2.00E-06

As you can see above, the original correlation was .78, but now after this transformation the correlation between the two transformed dimensions is very close to zero.  This is probably due to the rounding and approximation of the solution found by using the where clause as a solver.  In addition I have visualized the results from the zzz_example_final_transform table in Tableau Public with a loose description of  the two PCA dimensions.


PCA in Tableau.PNG


Principle Component Analysis in SQL with Google Bigquery. Part 1 Finding the covariance matrix and eigenvalues.

Hello again, it’s been a bit of time since my last post, but I have not been idle.  The reality is that most of my efforts have been directed fully toward my company’s needs and our move over to Google BigQuery.  I haven’t developed many more capabilities that I can share publicly.  KNIME & GoogleBigQuery have been great enabling me to move very fast with data and produce a few proof-of-concepts.  Just to think that maybe it was only 8 years ago I was doing everything in Microsoft Access and Excel.  Gone are those days.  Microsoft Excel is now just my Notepad.

One thing I do not have yet is the ability in KNIME to upload data back into BigQuery.  Perhaps that will be a capability soon, but BigQuery is all flush and fill, and KNIME seems to need a batch ability.  So long-story-short I wanted to do Principle Component Analysis on a dataset and transform two variables into their principle components but have it scheduled in Google BigQuery so it will automatically refresh when new data arrives.  So I reduced the problem into one totally solvable in BQ SQL.  I will not go into depth about why you need to do PCA, what the covariance matrix tells us about the data, nor how to interpret the eigenvalues or loadings.  That horse has been beat with a stick by people far more capable than me.  This is purely a discrete approximation implemented in SQL.  I am indebted to Lindsay I Smith who wrote this paper here:

Step 0:  Creating the test dataset and finding correlations.

Using the natality public data in Google BigQuery, I have created a dataset with at least two columns I assume are very correlated: mother_age and father_age.

CREATE OR REPLACE TABLE `projectid.datasetid.zzz_example_table` as (
FROM `bigquery-public-data.samples.natality` 
where year >= 2005
and father_age <> 99
--order by year desc
LIMIT 20000

Then I used Google BigQuery’s function called Corr to see the correlations between columns and as expected, mother_age and father_age are correlated.

CORR(mother_age,father_age) as ParentAge_Correl
from ​projectid.datasetid.zzz_example_table`
where father_age <> 99

… and as expected, those two columns are correlated:  ParentAge_Correl : 0.78433544  Highly correlated variables lead me to believe I will find a strong component from them.

Step 1:  Creating the Covariance Matrix.

Google BigQuery has done a great job on their statistical functions, and similar to the correlation function utilize above, I will first make a covariance matrix using the tools already made available.  You may want to make this a table as it will be used more than once, however the dataset is so small and BigQuery so fast that I didn’t see the need to.

round(COVAR_POP(mother_age,mother_age),5) as ParentAge_Covar11
,round(COVAR_POP(father_age,mother_age),5) as ParentAge_Covar12
,round(COVAR_POP(mother_age,father_age),5) as ParentAge_Covar21
,round(COVAR_POP(father_age,father_age),5) as ParentAge_Covar22 
from `projectid.datasetid.zzz_example_table`
where father_age <> 99

I am going to treat this as one row, with the suffix “*_RowIdColumnId” on any fields that are part of a matrix and starting the index at one.  One could make an actual matrix out of this using a union.  The it produces this output:

ParentAge_Covar11 ParentAge_Covar12 ParentAge_Covar21 ParentAge_Covar22
39.26018 35.54529 35.54529 52.31282

In addition, I have rounded everything on this matrix to the 5th decimal place.  If your data needs more precision or is very small, you will need to adjust accordingly, but this is accurate enough for my needs.

Step 2: Creating a dataset of possible values to solve for the eigenvalues.

My personal familiarity with the subject matter leads me to believe that human fecundity in this current technological age ranges between teenage for mother and to maybe sixties for the father and positively correlated and as a result I expect my eigenvalues to be between 0 and 100 years of age.  These are very big assumptions.  I encourage you to run it through KNIME’s various modules for data you do not know very well.  The below code will generate 1 million rows from 0.0000 to 99.9999, but this is super easy in BigQuery.  What’s a million rows these days?

with nbrs as (
Select 1 as nbr union all Select 0 union all Select 2 union all Select 3 union all 
Select 4 union all Select 5 union all Select 6 union all Select 7 union all Select 8 union all Select 9
valuez as (
Select round(tens.nbr*10.0+ones.nbr*1.0+tenths.nbr/10.00+ hundies.nbr/100.00 + thousies.nbr/1000.00 + tenthousies.nbr / 10000.00,5) as value
from nbrs tens
join nbrs ones on 1=1
join nbrs tenths on 1=1
join nbrs hundies on 1=1
join nbrs thousies on 1=1
join nbrs tenthousies on 1=1 
order by 1 desc
Select * from valuez

If your eigenvalues fall outside the range of this specific example, adjust accordingly.

Step 3:  Using the SQL where clause as a math solver.

Somebody on the internet is training people to use functions to create smart indices on SQL databases.  This was pretty novel in my opinion and he saved the left hand side of a  function to be an index and used the where clause as a solver.  That’s pretty bad-ass if you ask me.  I will attempt to utilize that method.  Now I am going to combine all the previous steps as common table expressions (CTE) and then solve for the eigenvalue.  I will select a long list of values I’m certain the eigenvalue exists in a table statement, Cartesian joining the covariance matrix, allowing for a rather loose threshold of being between -0.0500 and 0.0500 in the where clause, finally grouping those possible answers and ordering them by accuracy and then finding the two most accurate answers.  Note! depending on the accuracy you need and the rounding you select, you may get back zero, one, or two eigenvalues.  Play around with the rounding and adjust as needed.

CREATE OR REPLACE TABLE `projectid.datasetid.zzz_example_eigenvalues` as (
with nbrs as (
Select 1 as nbr union all Select 0 union all Select 2 union all Select 3 union all 
Select 4 union all Select 5 union all Select 6 union all Select 7 union all Select 8 union all Select 9
valuez as (
Select round(tens.nbr*10.0+ones.nbr*1.0+tenths.nbr/10.00+ hundies.nbr/100.00 + thousies.nbr/1000.00 + tenthousies.nbr / 10000.00,5) as value
from nbrs tens
join nbrs ones on 1=1
join nbrs tenths on 1=1
join nbrs hundies on 1=1
join nbrs thousies on 1=1
join nbrs tenthousies on 1=1 
order by 1 desc
,CovarMatrix as (
round(COVAR_POP(mother_age,mother_age),5) as ParentAge_Covar11
,round(COVAR_POP(father_age,mother_age),5) as ParentAge_Covar12
,round(COVAR_POP(mother_age,father_age),5) as ParentAge_Covar21
,round(COVAR_POP(father_age,father_age),5) as ParentAge_Covar22 
from `np-quotecenter.temp_kxe.zzz_example_table`
where father_age <> 99
,DENSE_RANK() over (order by EigenValue desc) as EigenValueID
from (
,dense_rank() over (order by abs(Accuracy) asc) as AccuracyValueRank
from (
,round(POW(EigenValue,2)-Lambda1*EigenValue+integervalue,5) as Accuracy
from (
valuez.value as EigenValue
,ParentAge_Covar11*ParentAge_Covar22-ParentAge_Covar21*ParentAge_Covar12 as integervalue
,ParentAge_Covar11+ParentAge_Covar22 as Lambda1
from valuez,CovarMatrix
-- union all --not sure if negatives are needed. Distant Math memories seem in tell me yes eigenvectors can be negative
-- Select 
-- -1*valuez.value 
-- ,ParentAge_Covar11*ParentAge_Covar22-ParentAge_Covar21*ParentAge_Covar12 as integervalue
-- ,ParentAge_Covar11+ParentAge_Covar22 as Lambda1
-- from valuez,CovarMatrix
) a
where round(POW(EigenValue,2)-Lambda1*EigenValue+integervalue,5) between -0.0500 and 0.0500
group by 1,2,3,4
--order by 4 desc
) a
order by AccuracyValueRank asc
) aa
where AccuracyValueRank = 1

I am grouping the values, finding the numbers that get me the closest to zero and then selecting those two.


If you run the above code in Google BigQuery, you will get these results:

Row EigenValue AccuracyValueRank EigenValueID
1 81.926 1 1
2 9.647 1 2

How close were those eigenvalues compared to the real ones?

Row KNIME Eigenvalue SQL Eigenvalue % Diff
1 81.930053 81.926 0.0049%
2 9.6475241 9.647 0.0054%

The KNIME eigenvalue came from the PCA compute module, and a quick check confirms they are very close to each other.


Using the where clause as a solver is not the most ideal solution, but it works well for now.  A Future application is the use an actual factorization formula to find the eigenvalues accurately and without the million rows, but it worked well enough for this example.  I’m not going to sugar-coat this, it is possible that if the eigenvalues are very close then the above method will not work.  I used two columns I knew were very correlated and likely Fathers age > Mothers age.  I believe you’ll have to uncomment the negative  union all valuez part to find those.

Next week I will calculate the rest of the matrix you see in the KNIME image above, take the original data and transform it into the new principle component dimensions.  Then I will use the Corr() function in BigQuery to check and make sure the new PCA components are not correlated.

Google BigQuery Connector in KNIME

KNIME.Com posted a method for connecting a KNIME workflow to Google Big Query:  https://www.knime.com/blog/GoogleBigQuery_meets_SQLite

This article blended SQLite and BigQuery and datasources.  In my life, KNIME has been a combination of a large roll of duct-tape, bailing wire, coupled with a Leatherman & Swiss Army Knife.  It is a tool to connect anything to anything else and keep all that embedded logic and enables analytics to occur across any dataset with any tool.  The BigQuery connector in their example did not quite work out-of-the-box for me as they had it set up in their article.  I took their workflow and simplified a few steps and was able to get BigQuery to connect in a KNIME workflow!

Here are the detailed instructions I used to connect.

  1.  Download and install the SIMBA JDBC driver from this webiste.  I used version 4.2 on my machine:  https://cloud.google.com/bigquery/partners/simba-drivers/
  2.  Unzip this into your KNIME folder.
  3. Point KNIME to the directory shown above.  Click on File > Preferences then KNIME > Databases and add the directory
  4. Copy the first two modules in their KNIME example workflow called 10_GoogleBigQuery…  in the examples server.  Search for “bigquery” in the KNIME Explorer and it will be there.  Open that workflow and then save it to your workspace.

    Remove everything but the first two modules to the top left.  Then add in the default JDBC “Database Reader” query next to the Database Connector and connect them.  Also read the KNIME.com link I posted above.  They state that doing the “Credentials Input” module encrypts the password data.
  5. Reconfigure the middle module “Connect to Bigquery DB…”with these settings

    The Database Driver should now have the dropdown for the com.simba.googlebigquery.jdbc42.Drive   replace the database URL with this information replacing “xxxxxx” with your billing project.

  6. Click on that same module “Database Connector” and press play.  It will give you a popup with a URL.  Copy and Paste that URL into a new web browser (I had to use a browser with only one Google sign-in).  Then you will get a screen like this with a special code

    Copy and paste that code xxxxxxx-xxxxxx back into the KNIME dialog box.
    Updated Note:  This step above *must* be done quickly, as the module may time-out by the time you enter your credentials in.
  7. You should now get a green finished module.  In the subsequent Database Reader module, enter this standard SQL query (not legacy) and run it.  This is a simple pull from a public dataset on BigQuery about taxi cab pickups.  This is a super-simple pull of only 10 lines.
    SELECT pickup_longitude, pickup_latitude 
    FROM `nyc-tlc.green.trips_2014` LIMIT 10
  8. Tada!  Data from BQ into KNIME.  Enjoy and make the magic happen!  Data to the people!

Adjustable color Saturation in Tableau

Have you ever wanted to allow your user more choices in coloring a heatmap?  Currently Tableau does not allow the viewer of a visualization to adjust the colors of a visualization.  The user must have the web-edit ability or Tableau Desktop to make major changes to the color scales.  Although with a few tricks you could adjust a dashboard based on a parameter and show different views, I wanted to allow my users more control without those methods.  Specifically I wanted a heat-map where the user can control how much saturation the extreme values have and how much those values visually ‘overpower’ the lesser values.  Conversely, a user could ’empower’ the lower values and equalize them with the outliers.

The first thing I did was make a list parameter, populate it with some numeric values and then more user-fiend “Display as” values.   The reason I did this is to allow the user several options without offloading the entire decision-making process to them.  This will also keep the view consistent over many viewers or iterations, and relieve the user of too many choices.  The colors could go wildly wrong.  Next I added the parameter to the worksheet and made it a slider-type parameter.  Tableau’s Parameters have several options, but to me the ‘slider’ makes sense in situations where there are a few concrete options.
Color Desat Parameter v2

Next I feed these parameters into a new formula that I called “Colorizer”:

[Colorizer] : log([Mag])^[Color Desaturator]

I’ve calculated out the possibilities below in Excel.  (Yes although you *can* do such a chart in Tableau, Excel still is good at what it does best.)


As you can see at the table on the right-side.  The difference between the Min and Max values (diff)  varies wildly depending on the parameter selected.  For this specific dataset based on Earthquake magnitudes the above formula works well, however for your specific data you may need to add in a different base using this formula:

[Colorizer] : log([Mag],[base > 10])^[Color Desaturator]

As you can see below, the colors washout tremendously depending on the parameter selected.  This allows the user to focus on all numbers at the lowest setting, or only the outliers at the max setting.



The unfortunate side-affect of this is that the Tableau default color legend is now *incomprehensible*.  So the dataset I used is on earthquakes, which ranges from 6.0 to a maximum magnitude 9.6.  However look at the default legend, which I have now termed as the “Useless Legend”:

The result is that you will need to create another legend with the same color value (using the [Colorizer] field) but with the actual value (in this case it was the [Mag] field)  on the text description.  Here is an example:

Enjoy this simple trick by downloading and deconstructing the embedded view below



Using R’s Elastic library to connect KNIME to elasticsearch.

I’d like to highlight another great suite of KNIME connectors.  These have been in KNIME for a long time, but due to the incredible number of libraries within R, I can make KNIME do so much more.  I have a situation where I wanted to connect to ElasticSearch and extract JSON data, contort it into a data-frame and then export that into KNIME and eventually into Tableau.  Thankfully R has a lightweight ElasticSearch library called “elasticsearchr” .  I will utilize this in the R Module to extract the data.


Since we are depending on R to generate the data, the first module we are going to use is called ‘R Source (Workspace)‘.


I am passing variables three ElasticSearch variables into this module (red line): server, index, and document.  This is done so I can easily pass this information onto other R modules for re-use.  In this module here is the R script that I am using to query the data and extract a simple count by day for an action.

Without an aggregation:

This should suffice for simple data pulls even among events that may have different json structures:

########### CONSTRUCT the Query ###############
qr1 <- query('
 "bool": {
 "must": [
 "terms": {
 } },
 "range": {
 "data.xxxxxxxx": {
 "gte": "now-1d/d",
 "lt": "now+1d/d"
 } } }
 } }
####### GRAB THE DATA ###########
esdata <- elastic(knime.flow.in[["ES_Server"]],knime.flow.in[["MicroSvc"]],"clientaction") %search% (qr1)

The variable above ‘esdata’ is a list and I handle extraction in a separate module called R to Table:


In which I convert the data into a KNIME table by just addressing the specific data elements I want.

knime.out <- data.frame(

If one of those elements doesn’t exist in the specific JSON document, then KNIME will knidly null (the red ?) that field.  The first term is the “data.xxxx.xxxxx.xxxxxx” event referenced above.


This is the super-simple way to get the data.

With An Aggregation:

When the JSON structure might be more complicated and you only wish to have an aggregation, such as a count by day, then this method below enables that.

########### CONSTRUCT the Query ###############
qr1 <- query('{
 "term": {
 "data.xxxxxxxx": {
 "value": "xxxxx"
########### CONSTRUCT the AGG ###############
ag1 <- aggs('{
 "actions": {
 "terms": {
 "field": "data.xxxxxx","size": 10000
 "aggs": {
 "data.xxxxxxx": {
 "date_histogram": {
 "field": "data.xxxxxxx",
 "interval": "day"
 } } } }}
## Debugging
#print(qr1 + ag1)

esdata2 <- elastic(knime.flow.in[["ES_Server"]]
,knime.flow.in[["DocumentName"]]) %search% (qr1 + ag1)

The aggregation is not needed, and a straight ElasticSearch query can be executed, however if your JSON structure frequently changes it may be difficult to extract the information.

Extraction into KNIME again uses that “R to Table” module but with very different code.  Note that I am open to suggestions about this section as this R code might be very inefficient.

### get the length for the loop ###
maxi <- length(esdata2$key)

### make a list to contain the looped results ###
datalist = list()

### loop through, constructing a data frame at each point and pass that into a list ###
for (i in 1:maxi) {
 df1 <- as.data.frame(esdata2$data.xxxxxxx.buckets[[i]])
 dat<- data.frame(esdata2$key[i],esdata2$doc_count[i],df1$key_as_string,df1$doc_count)
 datalist[[i]] <- dat

### re-extract the list into a data frame ###
big_data = do.call(rbind,datalist)

## export back out to KNIME ##
knime.out <- as.data.frame(big_data)


I have utilized the above workflow and scheduled it to execute automatically every morning about the time I get to work.


Three frequently used Tableau date formulas.

I’m blogging about these formulas for two reasons.  Half to educate the world, and half for my own personal use to have all these formulas in once place.

First I often want to know the Monday if a single date.  This is easily accomplished using a formula called DATETRUNC in Tableau.  Others have pointed this out many times / Tableau defaults to this, so I don’t take any credit for that.

[WeekStart]: DATETRUNC('week', [Date of Event])

However if we tweak that and use the Tableau function Now() we can get the Day of the week for your local time.

[WeekOfNow]: DATETRUNC('week',now())

This of course isn’t that ground-breaking either.  However lets take the above two formulas and do a date difference between them.

[WeeksAgo]: DATEDIFF('week',[WeekStart],[WeekOfNow])

Now we can use this ‘WeeksAgo’ field to always exclude the current week from the view. Tableau’s default relative date filters include the current week.    These formulas force-anchors our visual to last week’s number, but automatically updates it without manual intervention.  Now we can always have a flowing calculation using a simple case statement constructed to only include the last 4 weeks, excluding the present week.  As I am writing this on 4-26-2017 here is what the view looks like with a simple Rolling 4 week include / exclude filter will now automatically updated this report every Monday morning.



Just to further clarify, if you had used Tableau’s Relative Date Filter of the last 4 weeks it would look include the current week (which is a partial week) and only include three full weeks:


Further Application:

Ever have that line-chart based off of live data?  Also have you had that ‘dangling line’ or ‘half-bar’ affecting the final view?  The above datediff can be applied to Today() and ‘day’ for the datediff to remove the current day’s values from a trend.

Useful KNIME connector. Model-to-SQL

I’d like to take this time to highlight a very useful KNIME module called PMML to SQL.


This lovely module allows you to transform simple data mining algorithms into an actual working SQL script.  PMML stands for “Predictive Model Markup Language” and it is a format for interchanging data-mining models between systems.  KNIME offers us  the ability to translate that directly into SQL.


For this example I am using the wine dataset and then doing an easy classification of whether a wine is red or white based on the physicochemical variables.  After loading both files into a SQL table called zzz_test_data (I used KNIME to do that) I also needed to create an identity column.  A unique Identity Column will be required by the PMML to SQL module.  Here is a simple method for doing that on my table:

alter table db.dbo.zzz_test_data add ident INT IDENTITY

After the above steps I then built this simple workflow to predict the color.


The logistic regression learner model above had these settings:


In addition, it may be helpful to create a view or rename columns to remove any spaces as the PMML to SQL module will automatically place an underscore in the name.  The module requires that all the variables you want to utilize must exist in the SQL environment.  There may be additional technical debt with stored procedures or ETL to perform certain transformations before you can apply the model.  So in this example I built an interaction variable called ‘FreeSulfPerc‘ out of the two variables [free sulfur dioxide] / [total sulfur dioxide] in SQL.  Combining variables within the SQL does have the added benefit of make the resulting PMML-to-SQL calculation smaller.  I also changed the outut from the model from citric_acid to [citric acid] as citric_acid

The output of the PMML-to-SQL module thus looks like this:

SELECT p.*, (
CASE WHEN PRed>=PWhite THEN 'Red' WHEN PWhite>=PRed THEN 'White' END) AS Color 
FROM (SELECT ident, s_Red / ((s_Red) + (s_White)) AS PRed, s_White / ((s_Red) + (s_White)) AS PWhite 
FROM (SELECT ident, EXP((-440.57444453231847 * p0 + -4.407701802582567 * p1 + 56.558477040792845 * p2 + 414.9181775692244 * p3 + 5.205643006902104 * p4 + 8.115457264590452 * p5 + 8.539945396713552 * p6)) AS s_Red, EXP(0.0) AS s_White 
FROM (SELECT ident, 1.0 AS p0, POWER(citric_acid, 1) AS p1, POWER(chlorides, 1) AS p2, POWER(density, 1) AS p3, POWER(pH, 1) AS p4, POWER(sulphates, 1) AS p5, POWER(FreeSulfPerc, 1) AS p6 
FROM (SELECT ident, [citric acid] AS citric_acid, chlorides AS chlorides, density AS density, pH AS pH, sulphates AS sulphates, [free sulfur dioxide]/[total sulfur dioxide] AS FreeSulfPerc 
FROM zzz_test_data AS ms_input) AS local_trans) AS x) AS s) AS p

Now we can take this exact SQL and combine it with our original SQL table and run it.

select a.*,b.PRed,b.PWhite,b.Color as PColor
from [dbo].[zzz_test_data] a
left join (
--model output from PMML to SQL from KNIME
 SELECT p.*, (
 CASE WHEN PRed>=PWhite THEN 'Red' WHEN PWhite>=PRed THEN 'White' END) AS Color 
 FROM (SELECT ident, s_Red / ((s_Red) + (s_White)) AS PRed, s_White / ((s_Red) + (s_White)) AS PWhite 
 FROM (SELECT ident, EXP((-440.57444453231847 * p0 + -4.407701802582567 * p1 + 56.558477040792845 * p2 + 414.9181775692244 * p3 + 5.205643006902104 * p4 + 8.115457264590452 * p5 + 8.539945396713552 * p6)) AS s_Red, EXP(0.0) AS s_White 
 FROM (SELECT ident, 1.0 AS p0, POWER(citric_acid, 1) AS p1, POWER(chlorides, 1) AS p2, POWER(density, 1) AS p3, POWER(pH, 1) AS p4, POWER(sulphates, 1) AS p5, POWER(FreeSulfPerc, 1) AS p6 
 FROM (SELECT ident, [citric acid] AS citric_acid, chlorides AS chlorides, density AS density, pH AS pH, sulphates AS sulphates, [free sulfur dioxide]/[total sulfur dioxide] AS FreeSulfPerc 
 FROM zzz_test_data AS ms_input) AS local_trans) AS x) AS s) AS p
) b on a.ident = b.ident


Here is the final output translated fully into SQL:apr_21_FinalOutput.PNG


You can imagine using this SQL to created a view that feeds a Tableau dataset which updates daily or in real-time.  So a Senior Analyst can create and embed a simple data-mining classification for consumption by various Junior Analysts munging the data / Financial Analysts doing a forecast / Operations predicting future usage.


Good Office Spaces for Data Collaboration.

This is going to be a sharp departure from my usual technical how-to guides that I post.  In the two years I’ve experienced several waves of changes to my work space that have been completely positive from the perspective of sharing data with people.  Tableau has been remarkable in allowing me to analyze data at the speed of thought.  Not only does it allow me to try things when I think of them it allows me to quickly translate what other people are thinking and show it to them immediately.  At times though the audience was limited to only the people who could fit in my cube or people who took the time to interact with my workbooks on the server.  Aside from the limited features of story points, it’s relatively hard to keep all viewers on the same guided path with an interactive workbook.  The default view in Tableau won’t even tell you who viewed the workbook.  In short, there are not many ways to let all viewers travel along the same path synchronously w/o a formal meeting time and meeting location.

Change #1:  The first change I experienced was from a high-wall cube to a low-wall one.  You’ve probably already realized this yourself, but it’s kind of hard to fit more than 2 people in your cubicle without feeling very cramped.  But with low cube walls the audience doesn’t need to crowd into the doorway of a single cube to watch the analyst.  They can be around without needing to be on top of you.  It also helps with the off-hand “hey check this out neighbor” or “how do you do X in Y” type conversations.

new office cubicles - Mark III - Arnolds Office Furniture real front
Let me turn my monitor around and show you!

Change #2:  Then I moved positions to a work environment with stand-up desks.  This allows for even more people to crowd around w/o feeling cramped since I could ditch my chair, and people didn’t have to hunch over or lean to see the monitor.  I could stand, and 4 people shoulder-to-shoulder could all see the screen at once and I could walk everyone through the analysis and they could all pitch questions and engage in deep dives quickly.

Most Popular Stand Up Desks
Everyone gather around for this!

Change #3: Very recently we had an Epson short-throw interactive projector installed in our room.   I believe this is the same model that we have in our office (there are non-interactive versions that are about 1/2 price).

Product Image
Short Throw Projector = Walk Right up & Touch Your Data

This is installed in our normal walk-space nearby my desk.  Placing it in semi-public space has a two fold benefit, A) we do not have to spend square footage to have a meeting space, nor do we B) fall victim to the curse of an always-booked conference room.  Since it’s short-throw, people can come and go without disturbing the image.  Our particular model enables users to interact with the image on the wall.  So at least 6 people can stand around and all reach out and interact with a Tableau Dashboard.  The interactivity is limited (it doesn’t do right-clicks or hover-overs), but a participant can easily click a filter and change the analysis.

Conclusion:  I believe that these changes have opened my mind to having a collaborative data sense-making space where everyone is on board the same train of thought when exploring the data.  This method is in some ways superior to publishing an interactive dashboard on Tableau Server.   First of all people may not even click on your report or see your email.  Then if they do they will invariably make different choices in interaction showing different results.  Thus in the end everyone gets a different experience.  Tableau themselves recognized this problem and provided a method for guided analysis by adding in Story Points to previous versions.

Important Note:  The situations I’ve described are strictly for collaborative ad-hoc exploratory data analysis.  This is NOT a collaborative design process where loads of people tell me what color that bar chart should be or how the dashboard should look.  I’ve already built the tools and we interact with them to collectively discover things.

Finished example: