Where there are too many values for the IN clause, Spotfire generates a temp table and then joins the temp table to the main query. Although I don't know the exact limit, it appears to be somewhere around 5000 - 7000 values.
For me, part of the roadblock to using Hadoop technologies was getting data out of Hadoop for end-users to consume. Sure, I could write some custom web app leveraging D3.js to do the graphical representation. However, for an environment where Spotfire and other reporting tools are used, I wanted to hook those tools up to Hadoop and visualize my data rather than building custom apps.
HBase is a low latency read/write database on top of Hadoop. In terms of Scalding, it's just another location to read or write data. The Spyglass project makes it very easy to sink data to HBase in a Scalding job.
Scalding-JDBC utilizes Cascading-JDBC which comes pre-built to support several relational databases. The bad news is that Oracle is not one of the default supported databases because the Oracle JDBC driver is not available on any public Maven repos.
This is definitely easier than my first attempts and running Scalding on a cluster. Using Kiji and the steps from the book allows you to focus on learning and writing Scalding code rather than environment setup.
While a bike race may not seem like a place to learn about development, I realized a few ways in which the race helped me become a better developer after I returned home and had a chance to digest the experience.
HBase is cool. Pig is cool. It should be easy to put cool and cool together, right? I tried and it took a while before I could have cool talking to cool. Thank goodness it finally worked - and it ends up being simple to get a Pig script to load data into an HBase table.