Kettles, Spoons, Kitchens, and Jobs

Using Kettle – Spoon ETL tool to move data from an RDBMS to MarkLogic

Don’t get me wrong.  My ETL-fu is strong.  Give me some SQL, XQuery, Java, Perl, and Python and I am a dangerous man when it comes to rapidly taking data out of your beloved RDBMS silo and putting it into a NoSQL database.  However scaling that ETL process to a large IT program that handles dozens of databases with thousands of tables can be a schedule and maintenance cost nightmare.  In this blog post I’m exploring an approach that I haven’t really tackled in a few years … using a visual ETL tool to pull data from a relational database, transform it into XML, and then upload it into a NoSQL database. (in this case MarkLogic)

I’ve had several bad experiences with open source ETL so I’ve come to demand two things:

  1. I need to be able to run the ETL jobs that I create in the tool outside of the IDE without upgrading to an enterprise edition
  2. The software needs to pass the open source “sniff test.”  By that I mean that I need to be able to download it from the internet, install it, and get a HelloWorld! working within 30 minutes.

I was able to get a basic example working with Kettle Spoon 4.2.0 (related to Pentaho Data Integration), but I had to give up on two or three other tools, before feeling like I had enough to feel justified writing a blog post.

Setup

I have MarkLogic 5.0 and PostgreSQL 9.1 running on one laptop and Spoon running on a second Windows laptop (it wouldn’t open on my Mac, though most people’s Youtube tutorial videos seem to be on Macs).  My PostgreSQL server has a database named “dave” with a table named “call” to represent call log information:

PostgreSQL source data

ETL Design

I’m going to pull data from a single PostgreSQL table using a JDBC connection from my ETL tool, map it’s columns to elements of a schema-less XML file, and then POST the file to a waiting HTTP accessible service (POX more than REST) over on my MarkLogic.  Ideally the MarkLogic endpoint would insert the file into a database, but for demo purposes I am just logging the received XML:

xquery version "1.0-ml";
xdmp:log(xdmp:quote(xdmp:get-request-body("xml")))

Spoon Transform

After creating a pretty generic JDBC connection to my PostgreSQL database, the first step was to create a Spoon Tranform

transform

The first widget here is an Input > Table input running a SQL call of

SELECT * from call

against my database connection.  The second widget is a Transform > Add XML tool which takes the row data from the table and, with some clicking, auto maps it to a flat XML structure.  The phone number integer from the call table was outputting with commas until I switched it to have a formatting pattern of #.

Add XML Column widget

The last step is a Lookup > Rest Client.  This appears to be a relatively new feature added to support all the NoSQL databases being used out there that use REST as their primary connection mechanism.  I just have to target a HTTP exposed XQuery module with a POST call and make sure I pass the output variable the Add XML step modified.

A test run results in the following on my MarkLogic ErrorLog.txt file (the expected result of the xdmp:log call)

Info: 6099-pentaho: <?xml version="1.0" encoding="UTF-8"?>
Info: 6099-pentaho: <call xml:lang="en"><id>1</id><phone>3235551234</phone><comments>Called Home</comments><name>Dave's Family</name></call>
Info: 6099-pentaho: <?xml version="1.0" encoding="UTF-8"?>
Info: 6099-pentaho: <call xml:lang="en"><id>2</id><phone>3235551234</phone><comments>Called Home Again</comments><name>Dave's Family</name></call>
Info: 6099-pentaho: <?xml version="1.0" encoding="UTF-8"?>
 …

As a final step I export the Transform to the filesyetm as a .ktr file using the Export file menu item. This will help when trying to run the ETL process from the command line.

Configuring a job for the transform

Kettle has a concept of a Job, which allows control from a flow composed of Transforms.  The following is a simple Job whose only purpose is exposing the Transform we made above.  The transform step is configured run the .ktr file we made above.

I then export this job to a .kjb using the same Export feature

Running from the Command Line

And now the most important part!  Being able to run the ETL from outside of the client GUI.   This part is absolutely essential because it allows me to use the ETL as a scheduled job or in a continous integration cycle.

The following is the command I run from the Windows command line to run my job:

kitchen.bat /file:.\mytransform\job.kjb /level:Minimal

If I were to get my Mac version of Kettle working, this is what I would run:

./kitchen.sh -file=./mytransform/job.kjb -level=Minimal

Next Experiments

Now I have to figure out if the tool can be used to export hierarchical XML obtained from a join across tables in the relational database.  I’ll leave that for another time.

Geo Reverse Query Performance

The MarkLogic Express License, which can be used in production for free, is most exciting for me because it includes both the geospatial and alerting features of MarkLogic server.  Combined with the ability to do reverse queries, these features make MarkLogic really stand out from ANY other technology when it comes to building real-time geospatial applications.  I’ll review a couple of concepts before diving into a performance profiling example:

Geospatial Query: MarkLogic has a number of geospatial construction and conversion functions built in, but querying usually boils down to finding which  geospatial points (a latitude / longitude pair) are within a geospatial region.  In the examples below I’ll be using a circle region to query for points.

Reverse Query: MarkLogic searches execute on inverted search indexes, which make up the MarkLogic Universal Index and optional additional user specified indexes like ranges, lexicons, and geospatial.  Using these indexes you can get a sub-second response to the question of which documents in a mountain of XML match a single fairly complex query.  Reverse queries allow you to do the opposite.  If you have a mountain of queries (say, descriptions of what users are interested in, rules, or alert specifications)  you can get a sub-second response to the question of which of the millions of queries match a single document.  Basically each query is serialized as XML, by the magic of the super-composable cts:query spec, and queried as if it were a document.  The cts:reverse-query() function turns any XML document into a query that will match stored queries.

Alerting:  Alerting is a MarkLogic support framework for having “actions” execute when a “rule” is matched by an incoming document.  Actions are XQuery scripts to be run.  Rules are cts:queries that target incoming documents.  The alerting framework simplifies creating and managing serialized queries, database triggers, and other complex parts of building an alerting app.

An example geospatial alert could be something like a real estate website which sends an email to a user when a new property becomes available within X miles of where the user has indicated they are interested in purchasing a home.  Another example is geoclassification.

An example of a geo reverse query that isn’t involved in alerting would be a real-time 3D choropleth map analytic (aka 3D geospatial facet, image thanks to http://thematicmapping.org/).  By knowing the polygon border of a country, state, neighborhood etc you could use a serialized polygon query and reverse query combination to classify documents as they are inserted and facet on this information as users refine their queries.

Performance Run:

As a test I’m going to attempt to show the scalability advantages of MarkLogic.  One of the problems with internet-scale alerting apps on traditional platforms is that they tend to have large hardware footprints or large latency.  If you have N alerts, then every time a new document is inserted, you might have to run N queries.  This does not scale well when the number of alerts starts climbing to meet the needs of large numbers of users.  If your site has ten thousand users who each have 100 alerts, that’s one million queries to be run on every database insert.  Not going to happen!  Most sites have to resort to periodic batch runs. This works, but it’s no longer real-time.  What if that alert is mission critical?  What if the alert could save someone’s life?

My test harness for real time reverse queries in MarkLogic.

I will submit a series of points which represents the path of an object on the earth’s globe.  There will be N+1 circles distributed over the globe which represent people interested in the object checking for intersections.  N circles will be negative matches, the +1 circle is the one that matches.  I will measure the time it takes to find for circles that match the object’s path for increasing values of N.


Random Circles (N) Average Query Time (s)
100 0.00237
1000 0.00305
10000 0.01336
100000 0.03499
1000000 0.26044

0.26 seconds to test one million alerts is fast!  The reverse query can be computed in parallel, so a MarkLogic cluster can distribute processing and the combine the answers (real-time MapReduce?) to allow for alerting on orders of magnitude more alerts without sacrificing sub-second response.

Lastly, here is my code:

xquery version "1.0-ml";

(: a whole bunch of random circles to make the reverse-query harder :)
let $n := 10000

for $count in (1 to $n)
let $lat := xdmp:random(180) - 90
let $lon := xdmp:random(360) - 180
let $query-doc :=
  
  let $a := cts:point( $lat,$lon)
  let $b := cts:point( $lat + 0.0001, $lon)
  let $circle := cts:circle( cts:distance($a, $b), $a)
  return
  <query>
{cts:element-geospatial-query( xs:QName("point"), $circle , ("coordinate-system=wgs84"))}
<a>{$a}</a>
<b>{$b}</b>
</query>

return
xdmp:document-insert( fn:concat("/circle-",$count,".xml"), $query-doc );

(: the actual matching circle :)

let $query-doc :=
  
  let $a := cts:point( 10.0005,36)
  let $b := cts:point( 10.00051,36)
  let $circle := cts:circle( cts:distance($a, $b), $a)
  return
  <query>
{cts:element-geospatial-query( xs:QName("point"), $circle , ("coordinate-system=wgs84"))}
<a>{$a}</a>
<b>{$b}</b>
</query>

return
xdmp:document-insert( fn:concat("/matching-cicle.xml"), $query-doc )

view raw circles.xqy This Gist brought to you by GitHub.
xquery version "1.0-ml";

(: the line of points which will become a reverse query to match circles:)

let $a := cts:point( 10.0005, 36)
let $b := cts:point( 10.00052, 36)
let $distance := cts:distance($a, $b)
let $bearing := cts:bearing($a, $b)
let $steps := 100
let $step-distance := $distance div $steps
let $points :=
   for $i in (0 to $steps)
   return
   cts:destination($a, $bearing, $step-distance * $i)

let $doc :=
<document>{
  for $point in $points
  return
  <point>{$point}</point>
}</document>

return
(

cts:search( /, cts:reverse-query( $doc ) ) ,
xdmp:elapsed-time()
)

XQuery and HTML5

XQuery generated HTML5XQuery is amazing at generating server-side dynamic XHTML.  PHP, Java, .Net and the like are good too but don’t have the advantage of a seamless connection to a storage model.  However, they do hav a big advantage over XQuery when it comes to HTML5 because they can serialize non-XML compliant HTML text.

With XQuery and XHTML the following code works just fine:

xquery version "1.0";
xdmp:set-response-content-type("text/html; charset=UTF-8"),
'<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">',
<html xmlns="http://www.w3.org/1999/xhtml">
    <head>
        <meta name="description" content="Awesome" />
        <title>Title</title>
        <script type="text/javascript" src="js/app.js" />
...

The advantage of building your XHTML in XQuery is now we can stick this code in a “View” as part of a Model-View-Controller software pattern.  I can write the view once and reuse it as a code component throughout the project, or across multiple projects.  However HTML5, the latest and greatest web technology with tons of momentum in the mobile web arena, prefers serialization that looks something more like the following:

<!doctype html>
<html>
    <head>
        <meta content="description" content="Awesome">
        <title>Title</title>
        <script src="js/app.js">
...

This is a problem for XQuery.  The un-closed tags are invalid XML, so XQuery causes an evaluation error that looks something like:

XDMP-UNEXPECTED: (err:XPST0003) Unexpected token syntax error, unexpected $end, expecting EndTagOpen_

The problem is that HTML not only allows for un-closed tags like <script> and <meta>, but actually almost requires them.  I believe this traces back to the SGML origins of HTML.  The following HTML creation attempts in XQuery are not parsed correctly by browsers:

<script src="js/app.js"/>   or   <script src="js/app.js"></script>

Because the XQuery running in MarkLogic is running in the “default” XML serialization mode, both of the script tags above are collapsed to the self closed tag

<script src="js/app.js"/>

This is a problem for many browsers which will tend to include the rest of the HTML body as a child of the script tag, either causing the read of the javascript in the header no to be parsed or actually eating the rest of the page and never rendering the body DOM.  I have found two solutions: Either write your XQuery a non-empty script tags so that XQuery won’t collapse them:

<script src="js/app.js">&nbsp;</script>

or change the output serialization to HTML mode in the declare section at the top of your script:

declare option xdmp:output "method=html";

You’ll have to remember to do this for every module that generates HTML because other tags, like <textarea>, have the same pitfall.  You could also do this for a MarkLogic app server across the board by changing it’s settings in the admin console.  I prefer the declare and see it as just another reason to have a separate “View” section of your MVC app.

The next hurdle to tackle is getting inline javascript and JQuery plugins and templating libraries which like putting curly braces { } in HTML attribute values to work well.  I’ll leave that for another day.  Here is my current take at an XQuery HTML5 Boilerplate view library:

xquery version "1.0-ml";

module namespace v-h5bp = "http://framework/view/v-h5bp";

(:  Stick this at the top of any module that generates HTML so that
    empty tags don't get truncated to non-empty tags :)
declare option xdmp:output "method=html";

(:
    HTML5 Template
    @author Dave http://www.front2backdev.com
    XQuery adaptation Public Domain.
        html5 boilerplate: Public Domain
        jQuery: MIT/GPL license
        Modernizr: MIT/BSD license
        Respond.js: MIT/GPL license
        Normalize.css: Public Domain

    Basis of this code is the HTML5 BOILERPLATE project
    Checkout http://html5boilerplate.com/mobile
:)

(:
    HTML5 Mobile Boilerplate layout

    $title -- The html head title of the page
    $script -- extra HTML5 nodes for the html head
        carful with your self-closed tags like <script/>, these may not parse correctly
        in browsers.  Instead make sure XML serialization creates something like the following:
            <script src="url to source"></script>
        If you don't put something inside the tag, MarkLogic will serialize this to
            <script src="url to source"/>
        which is supported by xhtml but not html5 parsers
    $html  -- HTML5 nodes to put in the body
    $meta-description -- meta description text
    $meta-content -- meta conent text
    $goog-analytics-id -- your UA-* Id for google analytics
    $cache-version -- iterate this variable to ignore cached CSS and JS
:)
declare function v-h5bp:mbp-page-layout(
    $title as xs:string,
    $script as node()*,
    $html as node()*,
    $meta-description as xs:string?,
    $meta-content as xs:string?,
    $google-analytics-id as xs:string?,
    $cache-version as xs:double
) {

xdmp:set-response-content-type("text/html"),
'<!doctype html>',
<!-- Conditional comment for mobile ie7 http://blogs.msdn.com/b/iemobile/ -->,
<!--[if IEMobile 7 ]>    <html class="no-js iem7"> <![endif]-->,
<!--[if (gt IEMobile 7)|!(IEMobile)]><!-->, <html class="no-js"> <!--<![endif]-->
<head>
    <meta charset="utf-8"/>

    <title>{$title}</title>
    <meta name="description" content="{$meta-description}"/>
    <meta name="author" content="{$meta-content}"/>

    <meta name="HandheldFriendly" content="True"/>
    <meta name="MobileOptimized" content="320"/>
    <meta name="viewport" content="width=device-width, initial-scale=1.0"/>

    <link rel="apple-touch-icon-precomposed" sizes="114x114" href="img/h/apple-touch-icon.png"/>
    <link rel="apple-touch-icon-precomposed" sizes="72x72" href="img/m/apple-touch-icon.png"/>
    <link rel="apple-touch-icon-precomposed" href="img/l/apple-touch-icon-precomposed.png"/>
    <link rel="shortcut icon" href="img/l/apple-touch-icon.png"/>

    <meta name="apple-mobile-web-app-capable" content="yes"/>
    <meta name="apple-mobile-web-app-status-bar-style" content="black"/>
    <script>/* <![CDATA[ */(function(a,b,c){if(c in b&&b[c]){var d,e=a.location,f=/^(a|html)$/i;a.addEventListener("click",function(a){d=a.target;while(!f.test(d.nodeName))d=d.parentNode;"href"in d&&(d.href.indexOf("http")||~d.href.indexOf(e.host))&&(a.preventDefault(),e.href=d.href)},!1)}})(document,window.navigator,"standalone")/* ]]> */</script>
    <link rel="apple-touch-startup-image" href="img/l/splash.png"/>

    <meta http-equiv="cleartype" content="on"/>
    <link rel="stylesheet" href="/css/style.css?v={$cache-version}"/>

    <script src="js/libs/modernizr-custom.js">&nbsp;</script>
    <script>/* <![CDATA[ */Modernizr.mq('(min-width:0)') || document.write('<script src="js/libs/respond.min.js">\x3C/script>')/* ]]> */</script>

    {$script}  

</head>

<body>

    {$html}

    <script src="//ajax.googleapis.com/ajax/libs/jquery/1.6.2/jquery.min.js">&nbsp;</script>
    <script>/* <![CDATA[ */window.jQuery || document.write('<script src="js/libs/jquery-1.6.2.min.js"><\/script>')/* ]]> */</script>

    <script src="js/script.js?v={$cache-version}"></script>
    <script src="js/mylibs/helper.js"></script>
    <script>/* <![CDATA[ */MBP.scaleFix();/* ]]> */</script>

    <script>
      var _gaq=[["_setAccount","{$google-analytics-id}"],["_trackPageview"]];
      (function(d,t){{var g=d.createElement(t),s=d.getElementsByTagName(t)[0];g.async=1;
        g.src=("https:"==location.protocol?"//ssl":"//www")+".google-analytics.com/ga.js";
        s.parentNode.insertBefore(g,s)}}(document,"script"));
    </script>

</body>
</html>

};

 


Updates:

In the effort of leaving around the best code samples possible I will likely be updating my posts as I learn better / cleaner ways of working with XQuery.  Here’s to progressive enhancement!

12/3/2011 : Using XQuery HTML output serialization to avoid empty tag collapsing.

12/4/2011 : Forgot helper script, better inlining of google analytics code.  Note double {{ and }} which XQuery serializes to a single { and } .