Principle Component Analysis in SQL with Google Bigquery Part 2. Finding the eigenvectors, Matrix Multiply, and checking.

In the last post, Principle Component Analysis in SQL with Google Bigquery. Part 1 I used SQL to find the eigenvalues.  Now I will use the temp table I made there and demonstrate how to apply the transformation back to the original data.  If you didn’t read that post or have your own eigenvalues and different data, then you should be able to apply this next part to your own data without reading the post.  My data is stored in a temp table called projectid.datasetid.zzz_example_table​ and my eignvalues are saved in a different table called projectid.datasetid.zzz_example_eigenvalues​ both of which are just generic placeholders.  I am not going into any detail on why/how, as the math behind it can be derived from this excellent paper by Lindsay I Smith.

Step 1:  Finding the resulting eigenvectors and the length of each vector.

In the query below, I am taking the two eigenvectors found in the previous post, the covariance matrix, and then subtracting the eigenvectors from the diagonal, setting x2 equal to one, and then solving for x1 for each eigenvalue.  Then I am calculating the length for each vector.

CREATE OR REPLACE TABLE `projectid​.datasetid.zzz_example_evectors` as (
with eval1 as (
max(EigenValue) as EigenValue1
from `projectid​.datasetid.zzz_example_eigenvalues`
where EigenValueID = 1
,eval2 as (
min(EigenValue) as EigenValue2
from `projectid​.datasetid.zzz_example_eigenvalues`
where EigenValueID = 2
,CovarMatrix as (
Select --this should be a saved table since we're using it twice, but whatever
round(COVAR_POP(mother_age,mother_age),5) as ParentAge_Covar11
,round(COVAR_POP(father_age,mother_age),5) as ParentAge_Covar12
,round(COVAR_POP(mother_age,father_age),5) as ParentAge_Covar21
,round(COVAR_POP(father_age,father_age),5) as ParentAge_Covar22 
from `projectid​.datasetid.zzz_example_table`
where father_age <> 99
,SQRT(POW(evector_11,2)+POW(evector_12,2)) as Vector1_Length
,SQRT(POW(evector_21,2)+POW(evector_22,2)) as Vector2_Length
from (
-1.00*(ParentAge_Covar12)/(ParentAge_Covar11 - EigenValue1) as evector_11
,1.00 as evector_12 --we are forcing this value to one and solving for the above using the first eigenvector
,-1.00*(ParentAge_Covar12)/(ParentAge_Covar11 - EigenValue2) as evector_21
,1.00 as evector_22 --we are forcing this value to one and solving for the above using the seond eigenvector
from CovarMatrix,eval1,eval2
) a

Next I will divide the eigenvectors by the vector length to re-scale the data and get the final matrix, that we will then use to transform the original data using matrix multiplication.  Again I will reference the paper above as invaluable to knowing this step.   These lengths are Vector1_Length and Vector2_Length in the SQL above.  Combing these in the query below will produce a decomposition matrix similar to what I would get out of KNIME.

* from 
1 as RowID
,evector_11/Vector1_Length as Col_1
,evector_12/Vector1_Length as Col_2
from `projectid​.datasetid.zzz_example_table`

union all

2 as RowID
,evector_21/Vector2_Length as Col_1
,evector_22/Vector2_Length as Col_2
from ​projectid​.datasetid.zzz_example_table`
) order by 1 asc
RowID Col_1 Col_2
1 0.6400828 0.76830594
2 -0.7683052 0.64008368

Output from the KNIME PCA Compute module:


Now I will take the original table data, find the means for mother_age and father_age in the tbl_avg CTE, then I will take the row’s value and subtract the mean_adjusted age from the rows.

CREATE OR REPLACE TABLE `​projectid​.datasetid.zzz_example_final_transform` as (
with tbl_avg as (
round(AVG(mother_age),6) as avg_mother_age
,round(AVG(father_age),6) as avg_father_age
from `​projectid​.datasetid.zzz_example_table`
, matrix as (
evector_11/Vector1_Length as Decomp_11
,evector_12/Vector1_Length as Decomp_12
,evector_21/Vector2_Length as Decomp_21
,evector_22/Vector2_Length as Decomp_22
from `​projectid​.datasetid.zzz_example_evectors`
,Decomp_11*mean_adjusted_mother_age+Decomp_12*mean_adjusted_father_age as PCA_Dim1
,Decomp_21*mean_adjusted_mother_age+Decomp_22*mean_adjusted_father_age as PCA_Dim2
from (
mother_age ,mother_age-avg_mother_age as mean_adjusted_mother_age
,father_age ,father_age-avg_father_age as mean_adjusted_father_age
,Decomp_11 ,Decomp_12
,Decomp_21 ,Decomp_22
from `​projectid​.datasetid.zzz_example_table` a, tbl_avg b, matrix m
) a

Finally I will run the correlation (CORR() in BigQery) and then run the correlation for the new data to verify that the resulting transformed dataset has zero correlation.

round(corr(mother_age,father_age),6) as Original_data_Correlation
,round(corr(PCA_Dim1,PCA_Dim2),6) as PCA_Transformation_Correlation
from `​projectid​.datasetid.zzz_example_final_transform`
Original_data_Correlation PCA_Transformation_Correlation
0.784335 2.00E-06

As you can see above, the original correlation was .78, but now after this transformation the correlation between the two transformed dimensions is very close to zero.  This is probably due to the rounding and approximation of the solution found by using the where clause as a solver.  In addition I have visualized the results from the zzz_example_final_transform table in Tableau Public with a loose description of  the two PCA dimensions.!/vizhome/NatalityDataPCA/PCAChanges

PCA in Tableau.PNG


Principle Component Analysis in SQL with Google Bigquery. Part 1 Finding the covariance matrix and eigenvalues.

Hello again, it’s been a bit of time since my last post, but I have not been idle.  The reality is that most of my efforts have been directed fully toward my company’s needs and our move over to Google BigQuery.  I haven’t developed many more capabilities that I can share publicly.  KNIME & GoogleBigQuery have been great enabling me to move very fast with data and produce a few proof-of-concepts.  Just to think that maybe it was only 8 years ago I was doing everything in Microsoft Access and Excel.  Gone are those days.  Microsoft Excel is now just my Notepad.

One thing I do not have yet is the ability in KNIME to upload data back into BigQuery.  Perhaps that will be a capability soon, but BigQuery is all flush and fill, and KNIME seems to need a batch ability.  So long-story-short I wanted to do Principle Component Analysis on a dataset and transform two variables into their principle components but have it scheduled in Google BigQuery so it will automatically refresh when new data arrives.  So I reduced the problem into one totally solvable in BQ SQL.  I will not go into depth about why you need to do PCA, what the covariance matrix tells us about the data, nor how to interpret the eigenvalues or loadings.  That horse has been beat with a stick by people far more capable than me.  This is purely a discrete approximation implemented in SQL.  I am indebted to Lindsay I Smith who wrote this paper here:

Step 0:  Creating the test dataset and finding correlations.

Using the natality public data in Google BigQuery, I have created a dataset with at least two columns I assume are very correlated: mother_age and father_age.

CREATE OR REPLACE TABLE `projectid.datasetid.zzz_example_table` as (
FROM `bigquery-public-data.samples.natality` 
where year >= 2005
and father_age <> 99
--order by year desc
LIMIT 20000

Then I used Google BigQuery’s function called Corr to see the correlations between columns and as expected, mother_age and father_age are correlated.

CORR(mother_age,father_age) as ParentAge_Correl
from ​projectid.datasetid.zzz_example_table`
where father_age <> 99

… and as expected, those two columns are correlated:  ParentAge_Correl : 0.78433544  Highly correlated variables lead me to believe I will find a strong component from them.

Step 1:  Creating the Covariance Matrix.

Google BigQuery has done a great job on their statistical functions, and similar to the correlation function utilize above, I will first make a covariance matrix using the tools already made available.  You may want to make this a table as it will be used more than once, however the dataset is so small and BigQuery so fast that I didn’t see the need to.

round(COVAR_POP(mother_age,mother_age),5) as ParentAge_Covar11
,round(COVAR_POP(father_age,mother_age),5) as ParentAge_Covar12
,round(COVAR_POP(mother_age,father_age),5) as ParentAge_Covar21
,round(COVAR_POP(father_age,father_age),5) as ParentAge_Covar22 
from `projectid.datasetid.zzz_example_table`
where father_age <> 99

I am going to treat this as one row, with the suffix “*_RowIdColumnId” on any fields that are part of a matrix and starting the index at one.  One could make an actual matrix out of this using a union.  The it produces this output:

ParentAge_Covar11 ParentAge_Covar12 ParentAge_Covar21 ParentAge_Covar22
39.26018 35.54529 35.54529 52.31282

In addition, I have rounded everything on this matrix to the 5th decimal place.  If your data needs more precision or is very small, you will need to adjust accordingly, but this is accurate enough for my needs.

Step 2: Creating a dataset of possible values to solve for the eigenvalues.

My personal familiarity with the subject matter leads me to believe that human fecundity in this current technological age ranges between teenage for mother and to maybe sixties for the father and positively correlated and as a result I expect my eigenvalues to be between 0 and 100 years of age.  These are very big assumptions.  I encourage you to run it through KNIME’s various modules for data you do not know very well.  The below code will generate 1 million rows from 0.0000 to 99.9999, but this is super easy in BigQuery.  What’s a million rows these days?

with nbrs as (
Select 1 as nbr union all Select 0 union all Select 2 union all Select 3 union all 
Select 4 union all Select 5 union all Select 6 union all Select 7 union all Select 8 union all Select 9
valuez as (
Select round(tens.nbr*10.0+ones.nbr*1.0+tenths.nbr/10.00+ hundies.nbr/100.00 + thousies.nbr/1000.00 + tenthousies.nbr / 10000.00,5) as value
from nbrs tens
join nbrs ones on 1=1
join nbrs tenths on 1=1
join nbrs hundies on 1=1
join nbrs thousies on 1=1
join nbrs tenthousies on 1=1 
order by 1 desc
Select * from valuez

If your eigenvalues fall outside the range of this specific example, adjust accordingly.

Step 3:  Using the SQL where clause as a math solver.

Somebody on the internet is training people to use functions to create smart indices on SQL databases.  This was pretty novel in my opinion and he saved the left hand side of a  function to be an index and used the where clause as a solver.  That’s pretty bad-ass if you ask me.  I will attempt to utilize that method.  Now I am going to combine all the previous steps as common table expressions (CTE) and then solve for the eigenvalue.  I will select a long list of values I’m certain the eigenvalue exists in a table statement, Cartesian joining the covariance matrix, allowing for a rather loose threshold of being between -0.0500 and 0.0500 in the where clause, finally grouping those possible answers and ordering them by accuracy and then finding the two most accurate answers.  Note! depending on the accuracy you need and the rounding you select, you may get back zero, one, or two eigenvalues.  Play around with the rounding and adjust as needed.

CREATE OR REPLACE TABLE `projectid.datasetid.zzz_example_eigenvalues` as (
with nbrs as (
Select 1 as nbr union all Select 0 union all Select 2 union all Select 3 union all 
Select 4 union all Select 5 union all Select 6 union all Select 7 union all Select 8 union all Select 9
valuez as (
Select round(tens.nbr*10.0+ones.nbr*1.0+tenths.nbr/10.00+ hundies.nbr/100.00 + thousies.nbr/1000.00 + tenthousies.nbr / 10000.00,5) as value
from nbrs tens
join nbrs ones on 1=1
join nbrs tenths on 1=1
join nbrs hundies on 1=1
join nbrs thousies on 1=1
join nbrs tenthousies on 1=1 
order by 1 desc
,CovarMatrix as (
round(COVAR_POP(mother_age,mother_age),5) as ParentAge_Covar11
,round(COVAR_POP(father_age,mother_age),5) as ParentAge_Covar12
,round(COVAR_POP(mother_age,father_age),5) as ParentAge_Covar21
,round(COVAR_POP(father_age,father_age),5) as ParentAge_Covar22 
from `np-quotecenter.temp_kxe.zzz_example_table`
where father_age <> 99
,DENSE_RANK() over (order by EigenValue desc) as EigenValueID
from (
,dense_rank() over (order by abs(Accuracy) asc) as AccuracyValueRank
from (
,round(POW(EigenValue,2)-Lambda1*EigenValue+integervalue,5) as Accuracy
from (
valuez.value as EigenValue
,ParentAge_Covar11*ParentAge_Covar22-ParentAge_Covar21*ParentAge_Covar12 as integervalue
,ParentAge_Covar11+ParentAge_Covar22 as Lambda1
from valuez,CovarMatrix
-- union all --not sure if negatives are needed. Distant Math memories seem in tell me yes eigenvectors can be negative
-- Select 
-- -1*valuez.value 
-- ,ParentAge_Covar11*ParentAge_Covar22-ParentAge_Covar21*ParentAge_Covar12 as integervalue
-- ,ParentAge_Covar11+ParentAge_Covar22 as Lambda1
-- from valuez,CovarMatrix
) a
where round(POW(EigenValue,2)-Lambda1*EigenValue+integervalue,5) between -0.0500 and 0.0500
group by 1,2,3,4
--order by 4 desc
) a
order by AccuracyValueRank asc
) aa
where AccuracyValueRank = 1

I am grouping the values, finding the numbers that get me the closest to zero and then selecting those two.


If you run the above code in Google BigQuery, you will get these results:

Row EigenValue AccuracyValueRank EigenValueID
1 81.926 1 1
2 9.647 1 2

How close were those eigenvalues compared to the real ones?

Row KNIME Eigenvalue SQL Eigenvalue % Diff
1 81.930053 81.926 0.0049%
2 9.6475241 9.647 0.0054%

The KNIME eigenvalue came from the PCA compute module, and a quick check confirms they are very close to each other.


Using the where clause as a solver is not the most ideal solution, but it works well for now.  A Future application is the use an actual factorization formula to find the eigenvalues accurately and without the million rows, but it worked well enough for this example.  I’m not going to sugar-coat this, it is possible that if the eigenvalues are very close then the above method will not work.  I used two columns I knew were very correlated and likely Fathers age > Mothers age.  I believe you’ll have to uncomment the negative  union all valuez part to find those.

Next week I will calculate the rest of the matrix you see in the KNIME image above, take the original data and transform it into the new principle component dimensions.  Then I will use the Corr() function in BigQuery to check and make sure the new PCA components are not correlated.

Google BigQuery Connector in KNIME

KNIME.Com posted a method for connecting a KNIME workflow to Google Big Query:

This article blended SQLite and BigQuery and datasources.  In my life, KNIME has been a combination of a large roll of duct-tape, bailing wire, coupled with a Leatherman & Swiss Army Knife.  It is a tool to connect anything to anything else and keep all that embedded logic and enables analytics to occur across any dataset with any tool.  The BigQuery connector in their example did not quite work out-of-the-box for me as they had it set up in their article.  I took their workflow and simplified a few steps and was able to get BigQuery to connect in a KNIME workflow!

Here are the detailed instructions I used to connect.

  1.  Download and install the SIMBA JDBC driver from this webiste.  I used version 4.2 on my machine:
  2.  Unzip this into your KNIME folder.
  3. Point KNIME to the directory shown above.  Click on File > Preferences then KNIME > Databases and add the directory
  4. Copy the first two modules in their KNIME example workflow called 10_GoogleBigQuery…  in the examples server.  Search for “bigquery” in the KNIME Explorer and it will be there.  Open that workflow and then save it to your workspace.

    Remove everything but the first two modules to the top left.  Then add in the default JDBC “Database Reader” query next to the Database Connector and connect them.  Also read the link I posted above.  They state that doing the “Credentials Input” module encrypts the password data.
  5. Reconfigure the middle module “Connect to Bigquery DB…”with these settings

    The Database Driver should now have the dropdown for the com.simba.googlebigquery.jdbc42.Drive   replace the database URL with this information replacing “xxxxxx” with your billing project.

  6. Click on that same module “Database Connector” and press play.  It will give you a popup with a URL.  Copy and Paste that URL into a new web browser (I had to use a browser with only one Google sign-in).  Then you will get a screen like this with a special code

    Copy and paste that code xxxxxxx-xxxxxx back into the KNIME dialog box.
    Updated Note:  This step above *must* be done quickly, as the module may time-out by the time you enter your credentials in.
  7. You should now get a green finished module.  In the subsequent Database Reader module, enter this standard SQL query (not legacy) and run it.  This is a simple pull from a public dataset on BigQuery about taxi cab pickups.  This is a super-simple pull of only 10 lines.
    SELECT pickup_longitude, pickup_latitude 
    FROM `` LIMIT 10
  8. Tada!  Data from BQ into KNIME.  Enjoy and make the magic happen!  Data to the people!

Adjustable color Saturation in Tableau

Have you ever wanted to allow your user more choices in coloring a heatmap?  Currently Tableau does not allow the viewer of a visualization to adjust the colors of a visualization.  The user must have the web-edit ability or Tableau Desktop to make major changes to the color scales.  Although with a few tricks you could adjust a dashboard based on a parameter and show different views, I wanted to allow my users more control without those methods.  Specifically I wanted a heat-map where the user can control how much saturation the extreme values have and how much those values visually ‘overpower’ the lesser values.  Conversely, a user could ’empower’ the lower values and equalize them with the outliers.

The first thing I did was make a list parameter, populate it with some numeric values and then more user-fiend “Display as” values.   The reason I did this is to allow the user several options without offloading the entire decision-making process to them.  This will also keep the view consistent over many viewers or iterations, and relieve the user of too many choices.  The colors could go wildly wrong.  Next I added the parameter to the worksheet and made it a slider-type parameter.  Tableau’s Parameters have several options, but to me the ‘slider’ makes sense in situations where there are a few concrete options.
Color Desat Parameter v2

Next I feed these parameters into a new formula that I called “Colorizer”:

[Colorizer] : log([Mag])^[Color Desaturator]

I’ve calculated out the possibilities below in Excel.  (Yes although you *can* do such a chart in Tableau, Excel still is good at what it does best.)


As you can see at the table on the right-side.  The difference between the Min and Max values (diff)  varies wildly depending on the parameter selected.  For this specific dataset based on Earthquake magnitudes the above formula works well, however for your specific data you may need to add in a different base using this formula:

[Colorizer] : log([Mag],[base > 10])^[Color Desaturator]

As you can see below, the colors washout tremendously depending on the parameter selected.  This allows the user to focus on all numbers at the lowest setting, or only the outliers at the max setting.



The unfortunate side-affect of this is that the Tableau default color legend is now *incomprehensible*.  So the dataset I used is on earthquakes, which ranges from 6.0 to a maximum magnitude 9.6.  However look at the default legend, which I have now termed as the “Useless Legend”:

The result is that you will need to create another legend with the same color value (using the [Colorizer] field) but with the actual value (in this case it was the [Mag] field)  on the text description.  Here is an example:

Enjoy this simple trick by downloading and deconstructing the embedded view below


Using R’s Elastic library to connect KNIME to elasticsearch.

I’d like to highlight another great suite of KNIME connectors.  These have been in KNIME for a long time, but due to the incredible number of libraries within R, I can make KNIME do so much more.  I have a situation where I wanted to connect to ElasticSearch and extract JSON data, contort it into a data-frame and then export that into KNIME and eventually into Tableau.  Thankfully R has a lightweight ElasticSearch library called “elasticsearchr” .  I will utilize this in the R Module to extract the data.


Since we are depending on R to generate the data, the first module we are going to use is called ‘R Source (Workspace)‘.


I am passing variables three ElasticSearch variables into this module (red line): server, index, and document.  This is done so I can easily pass this information onto other R modules for re-use.  In this module here is the R script that I am using to query the data and extract a simple count by day for an action.

Without an aggregation:

This should suffice for simple data pulls even among events that may have different json structures:

########### CONSTRUCT the Query ###############
qr1 <- query('
 "bool": {
 "must": [
 "terms": {
 } },
 "range": {
 "data.xxxxxxxx": {
 "gte": "now-1d/d",
 "lt": "now+1d/d"
 } } }
 } }
####### GRAB THE DATA ###########
esdata <- elastic([["ES_Server"]],[["MicroSvc"]],"clientaction") %search% (qr1)

The variable above ‘esdata’ is a list and I handle extraction in a separate module called R to Table:


In which I convert the data into a KNIME table by just addressing the specific data elements I want.

knime.out <- data.frame(

If one of those elements doesn’t exist in the specific JSON document, then KNIME will knidly null (the red ?) that field.  The first term is the “data.xxxx.xxxxx.xxxxxx” event referenced above.


This is the super-simple way to get the data.

With An Aggregation:

When the JSON structure might be more complicated and you only wish to have an aggregation, such as a count by day, then this method below enables that.

########### CONSTRUCT the Query ###############
qr1 <- query('{
 "term": {
 "data.xxxxxxxx": {
 "value": "xxxxx"
########### CONSTRUCT the AGG ###############
ag1 <- aggs('{
 "actions": {
 "terms": {
 "field": "data.xxxxxx","size": 10000
 "aggs": {
 "data.xxxxxxx": {
 "date_histogram": {
 "field": "data.xxxxxxx",
 "interval": "day"
 } } } }}
## Debugging
#print(qr1 + ag1)

esdata2 <- elastic([["ES_Server"]]
,[["DocumentName"]]) %search% (qr1 + ag1)

The aggregation is not needed, and a straight ElasticSearch query can be executed, however if your JSON structure frequently changes it may be difficult to extract the information.

Extraction into KNIME again uses that “R to Table” module but with very different code.  Note that I am open to suggestions about this section as this R code might be very inefficient.

### get the length for the loop ###
maxi <- length(esdata2$key)

### make a list to contain the looped results ###
datalist = list()

### loop through, constructing a data frame at each point and pass that into a list ###
for (i in 1:maxi) {
 df1 <-$data.xxxxxxx.buckets[[i]])
 dat<- data.frame(esdata2$key[i],esdata2$doc_count[i],df1$key_as_string,df1$doc_count)
 datalist[[i]] <- dat

### re-extract the list into a data frame ###
big_data =,datalist)

## export back out to KNIME ##
knime.out <-


I have utilized the above workflow and scheduled it to execute automatically every morning about the time I get to work.


Three frequently used Tableau date formulas.

I’m blogging about these formulas for two reasons.  Half to educate the world, and half for my own personal use to have all these formulas in once place.

First I often want to know the Monday if a single date.  This is easily accomplished using a formula called DATETRUNC in Tableau.  Others have pointed this out many times / Tableau defaults to this, so I don’t take any credit for that.

[WeekStart]: DATETRUNC('week', [Date of Event])

However if we tweak that and use the Tableau function Now() we can get the Day of the week for your local time.

[WeekOfNow]: DATETRUNC('week',now())

This of course isn’t that ground-breaking either.  However lets take the above two formulas and do a date difference between them.

[WeeksAgo]: DATEDIFF('week',[WeekStart],[WeekOfNow])

Now we can use this ‘WeeksAgo’ field to always exclude the current week from the view. Tableau’s default relative date filters include the current week.    These formulas force-anchors our visual to last week’s number, but automatically updates it without manual intervention.  Now we can always have a flowing calculation using a simple case statement constructed to only include the last 4 weeks, excluding the present week.  As I am writing this on 4-26-2017 here is what the view looks like with a simple Rolling 4 week include / exclude filter will now automatically updated this report every Monday morning.



Just to further clarify, if you had used Tableau’s Relative Date Filter of the last 4 weeks it would look include the current week (which is a partial week) and only include three full weeks:


Further Application:

Ever have that line-chart based off of live data?  Also have you had that ‘dangling line’ or ‘half-bar’ affecting the final view?  The above datediff can be applied to Today() and ‘day’ for the datediff to remove the current day’s values from a trend.

Useful KNIME connector. Model-to-SQL

I’d like to take this time to highlight a very useful KNIME module called PMML to SQL.


This lovely module allows you to transform simple data mining algorithms into an actual working SQL script.  PMML stands for “Predictive Model Markup Language” and it is a format for interchanging data-mining models between systems.  KNIME offers us  the ability to translate that directly into SQL.


For this example I am using the wine dataset and then doing an easy classification of whether a wine is red or white based on the physicochemical variables.  After loading both files into a SQL table called zzz_test_data (I used KNIME to do that) I also needed to create an identity column.  A unique Identity Column will be required by the PMML to SQL module.  Here is a simple method for doing that on my table:

alter table db.dbo.zzz_test_data add ident INT IDENTITY

After the above steps I then built this simple workflow to predict the color.


The logistic regression learner model above had these settings:


In addition, it may be helpful to create a view or rename columns to remove any spaces as the PMML to SQL module will automatically place an underscore in the name.  The module requires that all the variables you want to utilize must exist in the SQL environment.  There may be additional technical debt with stored procedures or ETL to perform certain transformations before you can apply the model.  So in this example I built an interaction variable called ‘FreeSulfPerc‘ out of the two variables [free sulfur dioxide] / [total sulfur dioxide] in SQL.  Combining variables within the SQL does have the added benefit of make the resulting PMML-to-SQL calculation smaller.  I also changed the outut from the model from citric_acid to [citric acid] as citric_acid

The output of the PMML-to-SQL module thus looks like this:

SELECT p.*, (
CASE WHEN PRed>=PWhite THEN 'Red' WHEN PWhite>=PRed THEN 'White' END) AS Color 
FROM (SELECT ident, s_Red / ((s_Red) + (s_White)) AS PRed, s_White / ((s_Red) + (s_White)) AS PWhite 
FROM (SELECT ident, EXP((-440.57444453231847 * p0 + -4.407701802582567 * p1 + 56.558477040792845 * p2 + 414.9181775692244 * p3 + 5.205643006902104 * p4 + 8.115457264590452 * p5 + 8.539945396713552 * p6)) AS s_Red, EXP(0.0) AS s_White 
FROM (SELECT ident, 1.0 AS p0, POWER(citric_acid, 1) AS p1, POWER(chlorides, 1) AS p2, POWER(density, 1) AS p3, POWER(pH, 1) AS p4, POWER(sulphates, 1) AS p5, POWER(FreeSulfPerc, 1) AS p6 
FROM (SELECT ident, [citric acid] AS citric_acid, chlorides AS chlorides, density AS density, pH AS pH, sulphates AS sulphates, [free sulfur dioxide]/[total sulfur dioxide] AS FreeSulfPerc 
FROM zzz_test_data AS ms_input) AS local_trans) AS x) AS s) AS p

Now we can take this exact SQL and combine it with our original SQL table and run it.

select a.*,b.PRed,b.PWhite,b.Color as PColor
from [dbo].[zzz_test_data] a
left join (
--model output from PMML to SQL from KNIME
 SELECT p.*, (
 CASE WHEN PRed>=PWhite THEN 'Red' WHEN PWhite>=PRed THEN 'White' END) AS Color 
 FROM (SELECT ident, s_Red / ((s_Red) + (s_White)) AS PRed, s_White / ((s_Red) + (s_White)) AS PWhite 
 FROM (SELECT ident, EXP((-440.57444453231847 * p0 + -4.407701802582567 * p1 + 56.558477040792845 * p2 + 414.9181775692244 * p3 + 5.205643006902104 * p4 + 8.115457264590452 * p5 + 8.539945396713552 * p6)) AS s_Red, EXP(0.0) AS s_White 
 FROM (SELECT ident, 1.0 AS p0, POWER(citric_acid, 1) AS p1, POWER(chlorides, 1) AS p2, POWER(density, 1) AS p3, POWER(pH, 1) AS p4, POWER(sulphates, 1) AS p5, POWER(FreeSulfPerc, 1) AS p6 
 FROM (SELECT ident, [citric acid] AS citric_acid, chlorides AS chlorides, density AS density, pH AS pH, sulphates AS sulphates, [free sulfur dioxide]/[total sulfur dioxide] AS FreeSulfPerc 
 FROM zzz_test_data AS ms_input) AS local_trans) AS x) AS s) AS p
) b on a.ident = b.ident


Here is the final output translated fully into SQL:apr_21_FinalOutput.PNG


You can imagine using this SQL to created a view that feeds a Tableau dataset which updates daily or in real-time.  So a Senior Analyst can create and embed a simple data-mining classification for consumption by various Junior Analysts munging the data / Financial Analysts doing a forecast / Operations predicting future usage.


Simple Tableau View Logger with Initial SQL

With Tableau 10 it has been possible to do Cross-Database joins.  In a version previous to that it was possible to do Initial SQL with Tableau Parameters.  If we combine these two features that means we can have a standard connector to log each Tableau view in a database of our choice.  This is an especially powerful feature because is can also log views to a tableau workbook that is viewed even through desktop mode.  However a user must be logged into Tableau Server.  Using the initial SQL I will show you, you will be able to see who looks at a report regardless of the data source.  In my example below I took data in an Excel sheet and joined it to a live connection.  Here as you can see in this 2-line log, I have tableau_app as a column.  I have below a display of the logged results showing the last moment time I updated the data (Tableau Desktop).  Then I published the workbook which it triggered another logged event, but this time from Tableau Server.


How To:

(Although this method works for a data blend or a data join, but I will only cover a data join at the moment. )

  1. In a Database you can write to,  Create a table with 5 columns like this:
    CREATE TABLE [dbo].[tableau_view_log](
     [event_DT] [datetime] NULL,
     [tableau_user] [varchar](50) NULL,
     [tableau_app] [varchar](50) NULL,
     [project] [varchar](50) NULL,
     [tabl_wkbk] [varchar](50) NULL
    ) ON [PRIMARY]
  2. Make a new data source in Tableau (or use an existing one), but within that dataset create a fake field, call it “A” and only put the letter “A” in it.  (This can also be done with a calculation in Tableau, but I believe this depends on the Tableau version and database version).
  3. Connect to an additional dataset by pressing ‘Add’ next to Connections.  Add a new connection to the database you can write to.
  4. Drag ‘New Custom SQL’ onto the pane and make a fake query to the tune of ‘Select distinct ‘A’ as ‘A’ from [database]..[testtable].  Make sure this only results in 1 line of data.
  5. Connect this fake query and join on the ‘A’ column.  Now we have a live SQL connection joined to a static Excel file.
  6. Open Edit the Orange connection and go into the initial SQL at the bottom of the connection page.  In the Initial SQL section enter in this code.  The 4th column should be manually changed by you the developer to give you a project section that should correspond to where you are publishing the report.
    use [DatabaseICanWriteTo];
    insert into tableau_view_log
    values (

*Note 1:* The account you connected with must have rights to write to the database I’ve called [DatabaseICanWriteTo].

This is where the multi-database join really shines.  Most of the time you will not have rights to write to the same database you are reading from.  I have a sandbox database where I have the rights to write to, but I can used a data blend *or* a multi-database join from the dataset I’m querying to my write-location and then I can save the views.  Plus getting write access to a production database would be nearly impossible in most work environments.

*Note 2:* You must leave the second connection as a LIVE connection.  You cannot use an extract or you will not have a very useful view logger.  The fake query that returns “A” would ideally be very fast, and possibly on your own purpose-built table containing only 1 row.  Since this is logging an event, there is some slowness to the page view.

*Note 3:* Depending on the servers cache settings it may not trigger a new refresh.  This is okay though, you don’t need every person’s refresh, but it should at least capture when a person views the report a few times a day.

The Caveats:

#1  I’ve been lazy hesitate to publish this tip.  Although I really think this is useful in specific situations, there is one huge drawback.  The drawback is that it transforms the published thumbnail view in Tableau Server with a generic user view image.  If your users are used to navigating visually through their Tableau sever project then it would annoy your users to death to see a wall of these:

With the current project structure of Tableau Server, visual navigation is often used.

#2 As mentioned earlier, a user must be logged into Tableau server to get all the information.  If a user doesn’t have Server access or doesn’t log in (which is sometimes the case for users who do no publish regularly) then the logger will look like this:
We manually configured this specific workbook to record ‘Local’ as the project name.


Good Office Spaces for Data Collaboration.

This is going to be a sharp departure from my usual technical how-to guides that I post.  In the two years I’ve experienced several waves of changes to my work space that have been completely positive from the perspective of sharing data with people.  Tableau has been remarkable in allowing me to analyze data at the speed of thought.  Not only does it allow me to try things when I think of them it allows me to quickly translate what other people are thinking and show it to them immediately.  At times though the audience was limited to only the people who could fit in my cube or people who took the time to interact with my workbooks on the server.  Aside from the limited features of story points, it’s relatively hard to keep all viewers on the same guided path with an interactive workbook.  The default view in Tableau won’t even tell you who viewed the workbook.  In short, there are not many ways to let all viewers travel along the same path synchronously w/o a formal meeting time and meeting location.

Change #1:  The first change I experienced was from a high-wall cube to a low-wall one.  You’ve probably already realized this yourself, but it’s kind of hard to fit more than 2 people in your cubicle without feeling very cramped.  But with low cube walls the audience doesn’t need to crowd into the doorway of a single cube to watch the analyst.  They can be around without needing to be on top of you.  It also helps with the off-hand “hey check this out neighbor” or “how do you do X in Y” type conversations.

new office cubicles - Mark III - Arnolds Office Furniture real front
Let me turn my monitor around and show you!

Change #2:  Then I moved positions to a work environment with stand-up desks.  This allows for even more people to crowd around w/o feeling cramped since I could ditch my chair, and people didn’t have to hunch over or lean to see the monitor.  I could stand, and 4 people shoulder-to-shoulder could all see the screen at once and I could walk everyone through the analysis and they could all pitch questions and engage in deep dives quickly.

Most Popular Stand Up Desks
Everyone gather around for this!

Change #3: Very recently we had an Epson short-throw interactive projector installed in our room.   I believe this is the same model that we have in our office (there are non-interactive versions that are about 1/2 price).

Product Image
Short Throw Projector = Walk Right up & Touch Your Data

This is installed in our normal walk-space nearby my desk.  Placing it in semi-public space has a two fold benefit, A) we do not have to spend square footage to have a meeting space, nor do we B) fall victim to the curse of an always-booked conference room.  Since it’s short-throw, people can come and go without disturbing the image.  Our particular model enables users to interact with the image on the wall.  So at least 6 people can stand around and all reach out and interact with a Tableau Dashboard.  The interactivity is limited (it doesn’t do right-clicks or hover-overs), but a participant can easily click a filter and change the analysis.

Conclusion:  I believe that these changes have opened my mind to having a collaborative data sense-making space where everyone is on board the same train of thought when exploring the data.  This method is in some ways superior to publishing an interactive dashboard on Tableau Server.   First of all people may not even click on your report or see your email.  Then if they do they will invariably make different choices in interaction showing different results.  Thus in the end everyone gets a different experience.  Tableau themselves recognized this problem and provided a method for guided analysis by adding in Story Points to previous versions.

Important Note:  The situations I’ve described are strictly for collaborative ad-hoc exploratory data analysis.  This is NOT a collaborative design process where loads of people tell me what color that bar chart should be or how the dashboard should look.  I’ve already built the tools and we interact with them to collectively discover things.

Finished example:


E-commerce as a % of Adjusted Retail Sales

This will be a departure from most of my previous posts.  Usually I just show how to do something in a tool but without any context on how I am actually using it at work to influence decisions and drive change.  Stick with me as I show you the ‘how’ first and then add in the context in the ‘why’ section.

The How in Tableau:

First download the data from the Federal Reserve:

Then dump each text file into 1 single directory on your computer.  In Tableau go to “Data” Then “New Data Source” (CTRL+D).  Choose “Text File” and then point to the directory where you downloaded the data.  All the files will appear in the left.  Drag in the ECOMSA file first.  This will form the backbone.  Then drag in each of the others and Tableau will automatically make the connections since the FRED calls each ‘date’ field the same name.

After this go into a Tableau Sheet.  Next you will want to create more “friendly” names.  I did this easily by just creating a new field for each measure and pointing it to the FED name.  I also multiplied each by 1.00 to force the ability to synchronize axis as some measures were integers and other decimals.  Your users will always be appreciative if you translate your deeper knowledge into something they understand.


Build the visualization and Share:
Ecommerce As a Percentage of Retail

The Why:

E-commerce is a huge growth area, and one I have had the pleasure of working in.  I have previously worked for Home Depot’s .COM department, and now I work for an internally accessed website for special ordering.  E-commerce is still an overall small percentage of the retail economy.  The Federal Reserve states that it is under 10% of total Retail Sales in the United States.  There are several categories of retail sales, but I have decided to exclude certain sales where the consumer has to physically travel somewhere in order to fill something up.  The two major ones are Food Service sales and Gasoline / Convenience Store sales.  I know there are several fast food companies with an online presence, but in my experience the sales is still tendered at the store, so I would not count this as true online sales.  In addition, most of their online sales are estimated to be a fraction of their total sales.  I have also excluded Automobile sales which have several regulatory hurdles preventing true online sales. This creates a new Baseline (green) retail sales amount.  Then I take the E-commerce (orange) and then calculate that as a percentage.


Now by subtracting out Gasoline sales, which are depressed by low oil prices, and automobile sales, which are buoyed up for the same reason, I can remove some of the noise from what I want to see.

I can then take that (e-Commerce) / (Retail Sales New Base) calculated field and then forecast that using the Forecast features in Tableau.  This shows a several percentage point increase in e-Commerce vs what the Federal Reserve shows and it’s forecast into next year.