Obstacle Course Tutorial 1: Loading public API data into ML

U.S. Air Force photo/Benjamin Faske

I’ve written a tutorial or two for this site, but I’m going to take a shot at a different way of putting out training materials.  Call it, “training by obstacle course.”  I believe there is an audience for training that appreciates learning by challenge rather than copy-paste.  Personally I learn by doing, so in order to internalize a technology I prefer to figure things out myself using the existing API documentation and a push in the right direction.  Let me know what you think!

Loading public data into ML challenge:

Objective:

  • Periodically populate a MarkLogic database with interesting full text content from NPR
  • Get used to looking up XQuery and MarkLogic functions in the API documentation

Prerequisitess:

Go to the National Public Radio API site and register a user account.  You can now get an API key to use their story API services.  These will get you an XML representation of stories in specific topics as well as full text transcriptions.  Play around with their query builders.  I decided to work on news stories from Afghanistan

  • http://api.npr.org/query?id=1149&apiKey=PUT_YOUR_KEY_HERE

Using your MarkLogic admin console (port 8001), create a new database called NPR.  You’ll have to create and attach a forest to this database before you’ll be able to use it.

Using your query console (http://HOSTNAME:8000/qconsole/) pull data from NPR’s API into your NPR database

  • Don’t forget to set the Query Console content source to the NPR database
  • Check out the xmp:http-get() method from the MarkLogic XQuery API docs.  (Alternate searchable version) This will pull data from NPR’s HTTP based API.
  • Note: xdmp:http-get returns two items.  The second item returned is the retrieved XML.
  • Insert the data into the NPR database using the xdmp:document-insert() function

If you are using the NPR “query” API like I did, the returned XML has <story> tags which have inside them <transcript> tags that include links to secondary API calls for retrieving transcripts of the audio content.  You have the option of using the story tags as they are, deciding to follow the links to the transcripts, or both!

  • Modify your code to pull the same query, but now loop through the stories and possibly the embedded transcripts ( you could use a FLWOR statement )
  • For each story optionally retrieve the referenced transcript from the web and insert it into MarkLogic.  The URI you insert the file into should probably involved the unique story Id from within the retrieved data.
  • Check to see if the document exists already using fn:doc-available().  If the document is already there, just skip the write action ( you could use an IF / ELSE statement )

Congrats!  You have now loaded data into MarkLogic as-is from a public API.  As a next step, deploying this code as a module

  • Make the above code an .xqy script and place it on the your file system.
  • Create an HTTP server on an open port which gets its data from the NPR database and the module from the file system with the “root” option set to the absolute path of  your folder containing the module.
  • Hitting this “deployed” script with a web browser at http://HOSTNAME:PORT/SCRIPTNAME.xqy will now execute the data loading module.

Alternatively you could now configure a MarkLogic scheduled task to load live data from the NPR data feed on a regular cycle.  The scheduled task controls are in the admin console under Configure > Groups > group_name > Scheduled Task inside the “Create”.  Tip: you’ll probably need to run the scheduled task as your admin user for now.  In a production situation you might consider creating a separate user with only the permissions needed to write to your database.

Congratulations, you are now pulling in semi-real time data to MarkLogic.  (The only more real time would be if you could get a content provider to push you data as it happens rather than polling for it on a schedule)  Any search app we build on this data will now be instantly more interesting as it will have new data each time you visit it!

Tutorial: XQuery 3D KML Histograms

Yesterday, my blog post on Software Engineering got over 2000 hits because I posted it on Hacker News as a blogging and social news experiment. (and because I am a huge nerd)  That night, I found myself staring at the real-time geospatial view in Google Analytics and got inspired to type up a similar visualization in XQuery as a thanks to the community for taking a few moments to skim my site.

Here is a tutorial for rendering a 3D KML Histogram, also known as a Choropleth diagram,  in MarkLogic.  The example will be on static data, but if the “event” content in the MarkLogic data was updated, building a real-time updating visualization would be trivial.  The code needs some tuning, but I’ll put it out here to let others use for their own purposes.

Source Data:

I copied and pasted the table of city names from my Google Analytics dashboard to excel.  I then exported the first two columns to CSV, imported this text file using xdmp:document-load, and split the file with fn:tokenize(fn:doc($uri),’\r’) to obtain the lines and fn:tokenize($line, ‘,’) to get the column values.

This would be much easier if Google Analytics put the lat/lng information for its Geo view in the detail table so that I could disambiguate cities with the same name.  For exmple I know for a fact there are Hacker-blog readers in both Melbourne, Florida and Melbourne, Australia.  In a real world scenario I would derive the lat/lng from requestor’s IP myself , just as Google is doing.  I’ll accept the innacuracy of taking the first city hit off the Google geocoding service for the purposes of this demo:

http://maps.googleapis.com/maps/api/geocode/xml?sensor=false&address=Melbourne

(please copy and paste to a new tab so you don’t run up my domain’s daily quota ;) )

To simulate geospatial events being tracked in MarkLogic, I’ll then insert one geocoded document into my database for each hit that I received and make sure to place this document in a collection called “event” to keep it separate from other docs in the database. How you model this isn’t too important, but here’s how I did it:

<?xml version="1.0" encoding="UTF-8"?>
<event>
  <name>Bethesda</name>
  <point>38.9846520, -77.0947092</point>
</event>
view raw event.xml This Gist brought to you by GitHub.

I place these files in a MarkLogic database that has a geospatial element index on the “point” localname.

Desired KML

So now onto generating the KML for a heatmap.  KML is an XML standard for Google Earth.  Each “bar” in the heatmap will actually be a colored extruded polygon which will look like a semi-transparent “skyscraper” and be represented in the KML like the following:

<Placemark>
  <description>
    <h3>4 documents in this region.</h3>
  </description>
  <styleUrl>#s0012</styleUrl>
  <Polygon>
    <extrude>1</extrude>
    <altitudeMode>relativeToGround</altitudeMode>
    <outerBoundaryIs>
      <LinearRing>
        <coordinates>-54,-36,209600 -54,-27,209600 -63,-27,209600 -63,-36,209600 -54,-36,209600</coordinates>
      </LinearRing>
    </outerBoundaryIs>
  </Polygon>
</Placemark>
view raw placemark.xml This Gist brought to you by GitHub.

The coordinates are triples (lon,lat,altitude) listed in counterclockwise order so the surface normals face outward and the  Google Earth lighting equations will work correctly.  The style reference is a pointer to a defined color style listed earlier in the KML.

Stepping through the Code

My URL rewriter for MarkLogic is a one liner for this example as I want every request on the whole port to go to my map.xqy script.

xquery version "1.0-ml" ;
(: just send everything to the KML generating script :)
"/map.xqy"
view raw rewrite.xqy This Gist brought to you by GitHub.

At the top of the module’s body I set up variables.  The first two are globals I’ll be updating with xdmp:set (which should be used carefully as it prevents XQuery from running tasks in parallel.  The $lat $lon and $count variables control the part of the globe that will be heatmapped and the rough number of gridded divisions in each dimension.

(: variables that will track maximum values :)
let $maxfreq := 0
let $maxregion := ()

(: analytics bounds -- set to entire world :)
let $lat1 := -90
let $lat2 := 90
let $lon1 := -180
let $lon2 := 180
let $count := 80
view raw variables.xqy This Gist brought to you by GitHub.

I then derive a $countx and a $county which will be the number of grid divisions in each dimension with an eye towards keeping the regions “square.”  Remember that longitude has twice the numerical range as latitude.

  (: attempt to make the buckets square :)
  let $distx := ($lon2 - $lon1) (: cts:distance( cts:point($lat1,$lon1), cts:point($lat1, $lon2) ) :)
  let $disty := ($lat2 - $lat1) (: cts:distance( cts:point($lat1,$lon1), cts:point($lat2, $lon1) ) :)
  let $mindist := fn:min(($distx,$disty))
  let $sidedist := $mindist div $count
  let $_ := xdmp:log(text{"sidedist",$sidedist},"error")
  let $countx :=
    if( fn:ceiling($distx div $sidedist) castable as xs:integer ) then
      xs:integer(fn:ceiling($distx div $sidedist))
    else
      $count * 2
  let $county :=
    if( fn:ceiling($disty div $sidedist) castable as xs:integer ) then
      xs:integer(fn:ceiling($disty div $sidedist))
    else
      $count
view raw square.xqy This Gist brought to you by GitHub.

I then run the histogram analysis in MarkLogic.  This is the easy part because the Search API has a built-in function for doing just this.

  let $searchres :=
      search:search(
          "",
      <options xmlns="http://marklogic.com/appservices/search">
<additional-query>
{cts:collection-query("event")}
</additional-query>
<constraint name="mygeo">
<geo-elem>
<heatmap s="{$lat1}" w="{$lon1}" n="{$lat2}" e="{$lon2}" latdivs="{$county}" londivs="{$countx}"/>
<facet-option>gridded</facet-option>
<element ns="" name="point"/>
</geo-elem>
</constraint>
<return-results>false</return-results>
<return-facets>true</return-facets>
</options>
      )
view raw search.xqy This Gist brought to you by GitHub.

I remove out of range boxes (the Search API returns regions stretching from the poles to your outer bounds, which I don’t want.

let $boxes :=
    for $box in $searchres//search:box
    let $s := xs:float($box/@s)
    let $w := xs:float($box/@w)
    let $n := xs:float($box/@n)
    let $e := xs:float($box/@e)
    return
    if( ($s ge $lat1) and
        ($n le $lat2) and
        ($w ge $lon1) and
        ($e le $lon2) ) then
            let $_ := xdmp:set( $maxfreq, fn:max( ($maxfreq, xs:integer( $box/@count )) ) )
            return
            $box
        else
          (: remove this box, because it is accounting for hits outside the search area :)
            ()

We then generate the KML markers.  I’ll keep a map of the styles so I can deduplicate them when serializing the KML.

(: Make sure to specify coordinates in CCW order so that KML lighting works correctly :)
declare function local:coord-from-box($box as cts:box, $alt as xs:double){
    <kml:coordinates>
{
        let $south := xs:string(cts:box-south($box))
        let $west := xs:string(cts:box-west($box))
        let $north := xs:string(cts:box-north($box))
        let $east := xs:string(cts:box-east($box))
        let $alt := xs:string($alt)
        return
        (
        
            fn:string-join((
                fn:string-join(($east,$south,$alt),","),
                fn:string-join(($east,$north,$alt),","),
                fn:string-join(($west,$north,$alt),","),
                fn:string-join(($west,$south,$alt),","),
                fn:string-join(($east,$south,$alt),",")
            )," ")
        )
    }
</kml:coordinates>
};


let $stylemap := map:map()
let $markers :=
    for $box at $x in $boxes
        let $s := xs:float($box/@s)
        let $w := xs:float($box/@w)
        let $n := xs:float($box/@n)
        let $e := xs:float($box/@e)
        let $freq := xs:integer($box/@count)
        return
      
          let $_ := if($freq eq $maxfreq) then xdmp:set($maxregion, $box) else ()
          let $alpha := if ($freq ge 1) then 0.5 else 0.1
          let $freq-prec := xs:double(fn:substring( xs:string($freq div $maxfreq), 1 , 5))
          let $color-prec := xs:double(fn:substring( xs:string((if($freq eq 0) then 1 else $freq) div $maxfreq), 1 , 5))
          let $style-name := fn:concat("s",fn:replace(xs:string($freq-prec),"\.",""))
          let $_ := if(map:get($stylemap,$style-name)) then () else map:put($stylemap, $style-name,
              <Style id="{$style-name}" xmlns="http://www.opengis.net/kml/2.2">
<PolyStyle>
<color>{local:html-color-from-percentage($color-prec, $alpha)}</color>
<colorMode>normal</colorMode>
</PolyStyle>
</Style>
              )
          let $alt := (200000 + (800000 * $freq-prec)) * (xs:float($sidedist) div xs:float(9.0))
          let $ctsbox := cts:box($s,$w,$n,$e)
          return
              <Placemark xmlns="http://www.opengis.net/kml/2.2">
<description>
<h3>{$freq} document{if($freq gt 1) then 's' else ()} in this region.</h3>
</description>
<styleUrl>#{$style-name}</styleUrl>
<Polygon xmlns="http://www.opengis.net/kml/2.2">
<extrude>1</extrude>
<altitudeMode>relativeToGround</altitudeMode>
<outerBoundaryIs>
<LinearRing>
{local:coord-from-box($ctsbox,$alt)}
</LinearRing>
</outerBoundaryIs>
</Polygon>
</Placemark>
view raw markers.xqy This Gist brought to you by GitHub.

The last step is to return the KML

  return (
      xdmp:set-response-content-type("application/vnd.google-earth.kml+xml"),
      '<?xml version="1.0" encoding="UTF-8"?>',
      
      <kml xmlns="http://www.opengis.net/kml/2.2">
<Document>
<name>KML Heatmap Global</name>
<open>1</open>
{$camera}
{
                <Style id="ml">
<IconStyle>
<color>FF3122D9</color>
</IconStyle>
</Style>,
            
                for $key in map:keys($stylemap)
                return
                map:get($stylemap,$key),
            
                $markers
            }
</Document>
</kml>
      )
view raw output.xqy This Gist brought to you by GitHub.

More Screenshots created by varying the input variables:

 

The Complete Source:


xquery version "1.0-ml";

import module namespace search="http://marklogic.com/appservices/search"
                    at "/MarkLogic/appservices/search/search.xqy";

declare namespace kml = "http://www.opengis.net/kml/2.2";

(:White to red color scale in BBGGRR format :)
declare variable $COLOR_SCALE :=
(
"CCFFFF",
"A0EDFF",
"76D9FE",
"4CB2FE",
"3C8DFD",
"2A4EFC",
"1C1AE3",
"2600BD",
"2600BD"
);


(:~
Converts number to single digit hex
I'm sure there is a better way to do this in XQuery, but I am lazy
:)
declare function local:numToHex($n) as xs:string {
    if($n gt 15) then 'f'
    else if($n gt 9) then
      if($n eq 10) then 'a'
      else if($n eq 11) then 'b'
      else if($n eq 12) then 'c'
      else if($n eq 13) then 'd'
      else if($n eq 14) then 'e'
      else if($n eq 15) then 'f' else '0'

    else xs:string($n)
};

(: takes from 1.0 to 0.0 :)
declare function local:numToHexPair($num) {
    let $intalpha := fn:round(255 * $num)
    let $big := $intalpha idiv 16
    let $small := $intalpha mod 16
    let $bigchar := local:numToHex($big)
    let $smallchar := local:numToHex($small)
    return
    fn:concat($bigchar,$smallchar)
};

(: Make sure to specify coordinates in CCW order so that KML lighting works correctly :)
declare function local:coord-from-box($box as cts:box, $alt as xs:double){
    <kml:coordinates>
{
        let $south := xs:string(cts:box-south($box))
        let $west := xs:string(cts:box-west($box))
        let $north := xs:string(cts:box-north($box))
        let $east := xs:string(cts:box-east($box))
        let $alt := xs:string($alt)
        return
        (
        
            fn:string-join((
                fn:string-join(($east,$south,$alt),","),
                fn:string-join(($east,$north,$alt),","),
                fn:string-join(($west,$north,$alt),","),
                fn:string-join(($west,$south,$alt),","),
                fn:string-join(($east,$south,$alt),",")
            )," ")
        )
    }
</kml:coordinates>
};

(: Color is specified in octal of hex pairs representing transparency
alpha and color triple AABBGGRR example: 88ff0000 :)
declare function local:html-color-from-percentage($freq-prec, $alpha) {
    let $colornum := xs:integer( fn:ceiling($freq-prec * fn:count($COLOR_SCALE)) )
    let $colornum := if($colornum eq 0) then 1 else $colornum
    let $color := $COLOR_SCALE[$colornum]
    return
    fn:concat(
        local:numToHexPair($alpha),
        $color
    )
};



 

(: variables that will track maximum values :)
let $maxfreq := 0
let $maxregion := ()

(: analytics bounds -- set to entire world :)
let $lat1 := -90
let $lat2 := 90
let $lon1 := -180
let $lon2 := 180
let $count := 80


  
  (: attempt to make the buckets square :)
  let $distx := ($lon2 - $lon1) (: cts:distance( cts:point($lat1,$lon1), cts:point($lat1, $lon2) ) :)
  let $disty := ($lat2 - $lat1) (: cts:distance( cts:point($lat1,$lon1), cts:point($lat2, $lon1) ) :)
  let $mindist := fn:min(($distx,$disty))
  let $sidedist := $mindist div $count
  let $_ := xdmp:log(text{"sidedist",$sidedist},"error")
  let $countx :=
    if( fn:ceiling($distx div $sidedist) castable as xs:integer ) then
      xs:integer(fn:ceiling($distx div $sidedist))
    else
      $count * 2
  let $county :=
    if( fn:ceiling($disty div $sidedist) castable as xs:integer ) then
      xs:integer(fn:ceiling($disty div $sidedist))
    else
      $count
  

  let $searchres :=
      search:search(
          "",
      <options xmlns="http://marklogic.com/appservices/search">
<additional-query>
{cts:collection-query("event")}
</additional-query>
<constraint name="mygeo">
<geo-elem>
<heatmap s="{$lat1}" w="{$lon1}" n="{$lat2}" e="{$lon2}" latdivs="{$county}" londivs="{$countx}"/>
<facet-option>gridded</facet-option>
<element ns="" name="point"/>
</geo-elem>
</constraint>
<return-results>false</return-results>
<return-facets>true</return-facets>
</options>
      )


let $boxes :=
    for $box in $searchres//search:box
    let $s := xs:float($box/@s)
    let $w := xs:float($box/@w)
    let $n := xs:float($box/@n)
    let $e := xs:float($box/@e)
    return
    if( ($s ge $lat1) and
        ($n le $lat2) and
        ($w ge $lon1) and
        ($e le $lon2) ) then
            let $_ := xdmp:set( $maxfreq, fn:max( ($maxfreq, xs:integer( $box/@count )) ) )
            return
            $box
        else
          (: remove this box, because it is accounting for hits outside the search area :)
            ()

let $stylemap := map:map()
let $markers :=
    for $box at $x in $boxes
        let $s := xs:float($box/@s)
        let $w := xs:float($box/@w)
        let $n := xs:float($box/@n)
        let $e := xs:float($box/@e)
        let $freq := xs:integer($box/@count)
        return
      
          let $_ := if($freq eq $maxfreq) then xdmp:set($maxregion, $box) else ()
          let $alpha := if ($freq ge 1) then 0.5 else 0.1
          let $freq-prec := xs:double(fn:substring( xs:string($freq div $maxfreq), 1 , 5))
          let $color-prec := xs:double(fn:substring( xs:string((if($freq eq 0) then 1 else $freq) div $maxfreq), 1 , 5))
          let $style-name := fn:concat("s",fn:replace(xs:string($freq-prec),"\.",""))
          let $_ := if(map:get($stylemap,$style-name)) then () else map:put($stylemap, $style-name,
              <Style id="{$style-name}" xmlns="http://www.opengis.net/kml/2.2">
<PolyStyle>
<color>{local:html-color-from-percentage($color-prec, $alpha)}</color>
<colorMode>normal</colorMode>
</PolyStyle>
</Style>
              )
          let $alt := (200000 + (800000 * $freq-prec)) * (xs:float($sidedist) div xs:float(9.0))
          let $ctsbox := cts:box($s,$w,$n,$e)
          return
              <Placemark xmlns="http://www.opengis.net/kml/2.2">
<description>
<h3>{$freq} document{if($freq gt 1) then 's' else ()} in this region.</h3>
</description>
<styleUrl>#{$style-name}</styleUrl>
<Polygon xmlns="http://www.opengis.net/kml/2.2">
<extrude>1</extrude>
<altitudeMode>relativeToGround</altitudeMode>
<outerBoundaryIs>
<LinearRing>
{local:coord-from-box($ctsbox,$alt)}
</LinearRing>
</outerBoundaryIs>
</Polygon>
</Placemark>
    

    let $camera := if($boxes) then
            <LookAt id="camera1">
<longitude>{fn:avg((xs:float($maxregion/@w), xs:float($maxregion/@e)))}</longitude>
<latitude>{fn:avg((xs:float($maxregion/@n), xs:float($maxregion/@s)))}</latitude>
<altitude>0</altitude>
<altitudeMode>relativeToGround</altitudeMode>
<heading>-10</heading>
<tilt>45</tilt>
<roll>0</roll>
<range>{8000000 * (xs:float($sidedist) div xs:float(9.0)) }</range>
</LookAt>
        else ()
        
        
    return (
      xdmp:set-response-content-type("application/vnd.google-earth.kml+xml"),
      '<?xml version="1.0" encoding="UTF-8"?>',
      
      <kml xmlns="http://www.opengis.net/kml/2.2">
<Document>
<name>KML Heatmap Global</name>
<open>1</open>
{$camera}
{
                <Style id="ml">
<IconStyle>
<color>FF3122D9</color>
</IconStyle>
</Style>,
            
                for $key in map:keys($stylemap)
                return
                map:get($stylemap,$key),
            
                $markers
            }
</Document>
</kml>
      )
      
view raw map.xqy This Gist brought to you by GitHub.

Tutorial: Mobile Shakespeare (Part 3 – Adding Search)

Mobile Shakespeare search screenIn the last part of this tutorial we skinned the Mobile Shakespeare app to be more memorable and distinctive.  Now it’s time to add some search functionality.

The complete code base for this sample is now up in gitub for your reference:

github/derickson/shake/xquery2

I’ve cleaned up the XQuery for readability and will be linking only portions of the code here.

This is the app we are going to build

REST

First we’ll set up a few new REST targets in /lib/config.xqy

<get path="play/:id/act/:act/scene/:scene/speech/:speech"><to>play#scene</to></get>
        
<get path="search"><to>search#get</to></get>
<post path="search"><to>search#get</to></post>

Note, I’ve added the ability to jump to a specific SPEECH back in the /play resource.  You can check out this code yourself in the github copy of the code.

Search Resource

Next we’ll make a new resource for the search REST targets in /resource/search.xqy .  At the top of this file we are going to import the MarkLogic Search API, which is a high level XQuery library that sits ontop of the core MarkLogic search function in the cts:* library.  The Search API is a great place to start when building XQuery web apps because it does so much for you.  Like all high level tools, you may eventually outgrow parts of the Search API and decide to use the cts:functions directly ( I do this all the time for intricate multi-tiered facets).  The cts functions are incredibly easy to compose and work much like boolean functions, but for now we’ll stick to the Search API.  So … that import statement

import module namespace search = "http://marklogic.com/appservices/search" at "/MarkLogic/appservices/search/search.xqy";

The most basic call of the Search API is

search:search( $searchTerm, $options)

The Search API $options parameter

The $searchTerm is a one-line string that fits a grammar specified in the second parameter, $options.  If you omit the options XML parameter the Search API defaults do a good job of emulating a “Google-like” search syntax, but I want to make some modifications.  Rather than start from scratch learning how to construct the XML that make up these option we can get the Search API defaults to use as a starting point by calling the following function in the Query Console or cq. (Don’t forget the import of the Search API module)

 search:get-default-options()

You can start editing from the Search API standard functionality.  Here is my finished $options object:

(: Search API options :)
declare variable $options :=
    <options xmlns="http://marklogic.com/appservices/search">
<concurrency-level>8</concurrency-level>
<debug>0</debug>
<page-length>10</page-length>
<search-option>score-logtfidf</search-option>
<quality-weight>1.0</quality-weight>
<return-constraints>false</return-constraints>
<!-- Turning off the things we don't use -->
<return-facets>false</return-facets>
<return-qtext>false</return-qtext>
<return-query>false</return-query>
<return-results>true</return-results>
<return-metrics>false</return-metrics>
<return-similar>false</return-similar>
<searchable-expression>//SPEECH</searchable-expression>
<sort-order direction="descending">
<score/>
</sort-order>
<term apply="term">
<!-- "" $term returns no results -->
<empty apply="no-results" />
<!-- Not sure why this isn't a default -->
<term-option>case-insensitive</term-option>
</term>
<grammar>
<quotation>"</quotation>
<implicit>
<cts:and-query strength="20" xmlns:cts="http://marklogic.com/cts"/>
</implicit>
<starter strength="30" apply="grouping" delimiter=")">(</starter>
<starter strength="40" apply="prefix" element="cts:not-query">-</starter>
<joiner strength="10" apply="infix" element="cts:or-query" tokenize="word">OR</joiner>
<joiner strength="20" apply="infix" element="cts:and-query" tokenize="word">AND</joiner>
<joiner strength="30" apply="infix" element="cts:near-query" tokenize="word">NEAR</joiner>
<joiner strength="30" apply="near2" consume="2" element="cts:near-query">NEAR/</joiner>
<joiner strength="50" apply="constraint">:</joiner>
<joiner strength="50" apply="constraint" compare="LT" tokenize="word">LT</joiner>
<joiner strength="50" apply="constraint" compare="LE" tokenize="word">LE</joiner>
<joiner strength="50" apply="constraint" compare="GT" tokenize="word">GT</joiner>
<joiner strength="50" apply="constraint" compare="GE" tokenize="word">GE</joiner>
<joiner strength="50" apply="constraint" compare="NE" tokenize="word">NE</joiner>
</grammar>
<!-- Custom rendering code for "Snippet" -->
<transform-results apply="snippet" ns="http://framework/lib/l-util" at="/lib/l-util.xqy" />
</options>;

Let’s go through it.  I turn off return of data I won’t be using for rendering.  For example my app has no facets:

<!-- Turning off the things we don't use -->
<return-facets>false</return-facets>

I want our search results to be the SPEECH tags inside the PLAY root elements.  In cts we would specify a “searchable expression” as the first param of cts:search. In the search API we add the following:

<searchable-expression>//SPEECH</searchable-expression>

Lastly i change some of the default text term options.  When a user doesn’t type anything, I’ll omit executing the search, rather than just pass back the first SPEECH in document order in the repository (which they can do from the Play button on the new home page anyways).  Also when testing the app I found that searches for “My kingdom for a horse” returned zero results.  That’s because the default for the Search API is “case-sensitive”.  (Note, it’s a good idea to turn on the index for fast case sensitive search if this is what you want)  But my user’s might type in “My” kingdom for a horse, so I’ll set a term option to case-insensitive:

<term apply="term">
    <!-- "" $term returns no results -->
    <empty apply="no-results" />
    <!-- Not sure why this isn't a default -->
    <term-option>case-insensitive</term-option>
</term>

The last step is to specify a custom snippeting library function.  The Search API assumes I am searching on text that is too large to present in a result, but I’d like my users to see the whole SPEECH in order to give the highlighted words context.  I’ll let you look at the highlight code yourself in github under /lib/l-util.xqy, but the portion of the $options that specifies which code to use is:

<!-- Custom rendering code for "Snippet" -->
<transform-results apply="snippet"
    ns="http://framework/lib/l-util"
    at="/lib/l-util.xqy" />

Wrapping up the Page

Next I’ll need a good search form.  I decided to have a Phrase Search toggle switch because mobile users often don’t have the quotatin marks of the standard Search API grammar on their keyboards without going to a SHIFT alternate keyboard:

<!-- Search Form -->
                <form action="/search" method="get" data-transition="fade" class="ui-body ui-body-b ui-corner-all">
                    <fieldset >
<label for="search-basic">Search all lines:</label>
<input type="search" name="term" id="term" value="{$term}" data-theme="b" />
</fieldset>
                    <div data-role="fieldcontain">
                        <label for="slider2">Phrase search:</label>
                        <select name="phrase" id="phrase" data-role="slider" >
                            <option value="off">
Off
</option>
                            <option value="on">
                                {
                                    (: Dynamic inline attribute of the option element :)
                                    if($phrase eq "on") then
                                        attribute selected {"selected"}
                                    else
                                        ()
                                }
                                On
                            </option>
                        </select>
</div>
                    <button type="submit" data-theme="b" data-transition="fade">Submit</button>
                </form>

And I’ll actually need to call the search.  I pass the Search API results XML object to a transform function which you can go through on github:

<p>
{
                (: Search Results Area :)
                
                (:
Modify the typed search term.
Add Quotes if the $phrase flag is "on"
If the term is empty sequence, use ""
:)
                let $searchTerm :=
                    if(fn:exists($term)) then
                        if($phrase eq "on" and fn:not( fn:starts-with($term,'"') and fn:ends-with($term,'"'))) then
                            fn:concat('"',$term,'"')
                        else
                            $term
                    else
                        ""
                return
                
                    (:
Think Functionally ...
XQuery invokes passes the evaluation of search:search
to transform-results
:)
                    
                    (: transform results into HTML5 :)
                    local:transform-results(
                        (: execute the search with the Search API :)
                        search:search($searchTerm, $options)//search:result
                    )
            }
</p>

So now in one XQuery script I have a basic search app that is already optimized for Mobile browsers.  I’m happy with the performance of the site on my relatively new Android phone, but I imagine we’d want to pay close attention to some of the HTML can can caching response headers being returned from MarkLogic given that the content is completely static .  Here are some items for improvement and exploration I could think of:

  • Check performance after “Phonegapping” the HTML5 into a native iOS or Android App
  • Add the HTML5 meta tags for specifying an Apple icon when this site is bookmarked on iOS home screen.
  • Add paging to the search screen
  • Add the ability to search within a specific play (this could be done quickly with a Search API constraint and a drop down)
  • Allow users to “star” lines as their favorites (no login really necessary) and put links to the most popular lines on the Mobile Shakespeare home screen.

Phrase-Through and Phrase-Around

However, instead of spending time on that, let’s tune the MarkLogic indices slightly.  One of the strengths of MarkLogic over other full text search indexers is that MarkLogic preserves structure.  As a result, MarkLogic can do inferred metadata search and full text search out of the same indices without extra configuration or integration (it scales well too!). Here’s one my my favorite lines from Macbeth

Quote From Macbeth Act 5 Scene 8

Some search engines flatten the text in their documents before generating search indexes.  This makes resolving relevance based on the surrounding XML or HTML tags very difficult imagine searching for the phrase “Untimely ripp’d Accursed be” in the following examples:

1.)
<div>... was <b>untimely</b> ripp'd. Accursed be ...</div>

2.)
<div><p>... was untimely ripp'd.</p><p>Accursed be ...</p>

3.)
<div>... was untimely
  <annotation class="hidden tooltip">Awesome!</annotation>
ripp'd. Accursed b...</div>

4.)
<SPEECH>
  <LINE>Tell thee, Macduff was from his mother's womb<LINE>
  <LINE>Untimely ripp'd.</LINE>
</SPEECH>
<SPEECH>
  <LINE>Accursed be the tongue ...
view raw example.html This Gist brought to you by GitHub.

The first sample might lead you to believe that flattening the text is a good idea.  By ignoring all structure, we get a relevant result because the words “untimely,” “ripp’d,” “accursed,” and “be” are adjacent.  The <b> tags represent style, not substance!  MarkLogic has something called a “Phrase-Through” index setting which tells the indexer to determine phraases through an element separation.  Obvious examples from XHTML, wordml, and other common namespaces come preconfigured so you don’t have to worry about them.

The second sample destroys the notion of text flattening.  Ripp’d and Accursed aren’t just in different sentences (as the period might inform some indexers), they are in different paragraphs and do not form a semantic “phrase”.  MarkLogic won’t Phrase-Through a <p> tag unless we tell it to so we get the correct behavior.

The third sample above is trickier.  Embedded into the text is markup that represents an inline annotation of the semi-structued data.  If I flatten the text, the word “Awesome” messes up our phrase, but a default parse of the XML structure also breaks up the semantics of the “untimely ripp’d” phrase.  MarkLogic can solve this with a “Phrase-Around” on the annotation tag.  A Phrase-Around setting tells MarkLogic to indexer to link the words before and after into a phrase but ignore the words inside.

The fourth sample is our data from the Shakespeare XML demo.  To get a good phrase search on “was from his mother’s womb untimely ripp’d” we need to set up a Phrase-Through on the LINE element in the Databases > shake > Phrase-Throughs setting on the MarkLogic admin menu.  Once we’ve done this a phrase enabled search for “was from his mother’s womb untimely ripp’d” results in:

correctly highlighted quote

Not bad.  By allowing Phrase-Through and Phrase-Around flexibility on the source XML schema we don’t have to transform the data to index it.  We get to preserver structure and have full text search at the same time!

– Dave