Machine Learning with Large Networks of People and Places

I recently gave a talk at foursquare HQ for the New York Machine Learning Meetup. Here’s the abstract, with the slides below. Hope you enjoy!

Foursquare is now aware of over 1.5 billion check-ins from 15 million people at 30 million different places all over the world. Each check-in can be thought of as an edge in a vast network connecting people to each other and to the places that they care about most. Graph-based machine learning algorithms are critical not only for making sense of these networks that emerge out of patterns of human mobility but also for creating useful data-driven products that help people better navigate the real world. In this talk, we will examine two networks that we have observed at foursquare, the Social Graph and the Place Graph, and then discuss various machine learning and big data techniques for better understanding these networks as well as using them to build a novel recommendation engine we call Explore.

Machine Learning with Large Networks of People and Places

- @metablake

A Hackday Project: What neighborhood is the ‘East Village’ of San Francisco?

Have you ever wondered what’s the equivalent of your neighborhood in another city? How you’d find the Times Square of Tokyo? The Beverly Hills of Dallas? Or the East Village of San Francisco? For a hackday project this January, we mapped our 1,500,000,000 check-ins to 140,000 neighborhoods all over the world to better understand and compare the different places we live, work, and play. Here is a brief account of our hack.

First, to collect data about neighborhoods, we built some Hive queries to access our large collection of check-ins (stored in S3) and count the number of check-ins per category for every neighborhood in the world. For example, the East Village of New York has 230k check-ins at bars, 57k check-ins at pizza places, 18k check-ins at yoga studios, and 34k check-ins at karaoke places (hipsters like to sing!).

We then used MATLAB to represent each neighborhood by a 400 dimensional vector which specifies the normalized probability distribution of checking in to a place in each category relative to the baseline distribution of the city. This approach allows us to compare neighborhoods with each other using a similarity metric such as cosine similarity, or KL divergence.

Here is a visualization of the similarity matrix for NYC neighborhoods. The blue entries indicate two neighborhoods are very similar, and the red entries indicate neighborhoods most different. The ordering is determined by a k-means clustering on the data, meaning similar neighborhoods will be ordered close to each other. Looking along the diagonal of this matrix we see groups of places which are very similar to each other such as the east village, the lower east side, and alphabet city.

(click the image for full-size)

It turns out that a good proxy for describing a neighborhood is the proportion of activities that go on inside it. For example, imagine if two neighborhoods both have lots of check-ins at apartments, colleges, and food trucks (think college towns). Those two neighborhoods are more similar than a neighborhood that has tons of check-ins at offices and retail stores.

Here are some top categories based on neighborhoods:
Soho, New York: clothing stores, offices, electronics stores, coffee shops, French restaurants
Mission, San Francisco: Mexican restaurants, bars, coffee shops, burrito places
Kendall Square, Boston: offices, food trucks, tech startups, college academic buildings, sandwich places
Hollywood, LA: nightclubs, multiplexes, burger joints, hotels, bars

At this point in the hack day, we shared this data with the rest of the company so people could explore their own neighborhoods. foursquare HQ had just moved from the East Village to Soho, and the whole office was eager to see a heads-up comparison. So we put together a quick website using Ruby, Sinatra, Twitter Bootstrap, and the d3.js library for visualization. This allowed us to better visualize pairwise comparisons of neighborhoods and to easily click through the whole dataset.

Here is a visualization of the differences between the East Village to Soho:

We see that Soho has a lot of activity at offices and clothing stores, whereas the East Village has a lot of activity at bars and pizza places.

We also can now algorithmically compare neighborhoods across different cities:

Most similar to NY’s East Village in San Francisco:
Mission Dolores
Cow Hollow
Telegraph Hill

Most similar to SF’s Chinatown in NY:
Chinatown — obviously :)
Downtown Flushing
Long Island City

Most similar to Seattle’s Capitol Hill in SF:
Mission Dolores
Castro
Hayes Valley

Most similar to NY’s Coney Island in Orlando:
Walt Disney World Resort
Florida Center
Sea World Theme Park

Inherent in Foursquare’s 1,500,000,000 check-ins is a staggering amount of information about the characteristics of cities. It is now possible to quantify and measure the ways people interact with neighborhoods at a higher resolution than ever before. This whole hackday project was achievable in just a day and a half by two engineers because of the amazing data, infrastructure, and tools provided at Foursquare. There are many possible directions this project can go; for example, we’re looking forward to including user demographics and time-based information into the model. If you have some good ideas for what to try next, please leave them in the comments, or better yet, join us and try them yourself!

- @metablake and @rathboma

Two days of tinkering: hacking together Squaredar!

Every couple of months, team foursquare takes a break from relentlessly shipping new features and spends a couple weeks in an engineering “fix-it.” During this time, all new development stops, and instead we try to pay down our technical debt by refactoring and polishing our code, fixing pet bugs that have been bothering us, improving our test coverage, and adding logging and metrics everywhere (we’re a little addicted to data).

At the end of the fix-it, when all the points have been tallied and the winners announced (yes, this is foursquare, so we make a game of it), we hold an internal Hack Day where engineers, designers and PMs from across the company get together and hack on whatever they want. It’s a time to prototype out a feature that you’ve always wanted, to build something that no one has thought of, to mess around in a part of the code that you normally don’t get to work with, or to give some application to a new library that you’ve been meaning to try out.

Our most recent Hack Day also happened to coincide with our last day at our old office at 36 Cooper Square in Manhattan’s East Village. Hack presentations happened the next Monday at our shiny new office in SoHo. It was a nice way to say goodbye to the old office and mark the transition to our new digs.

In the next few posts we’re going to highlight some interesting projects that came out of Hack Day. Our first project comes courtesy of Akshay, our platform evangelist and all-around API expert. His hack is somewhat unique in that it was built purely on top of our public API. As he explains:

Some of my favorite foursquare moments are discovering that a far flung friend is visiting or a “local” friend from across town is literally next door. Right now these discoveries are totally serendipitous, so I wanted to build a system that would alert me when a friends are much closer than normal. I decided to call it Squaredar.

Unlike most foursquare hack day participants, I’m not fluent in Scala, so Squaredar doesn’t use any internal-only data or code. Instead, Squaredar just uses our public API. This is possible because our API is incredibly powerful; it’s the same API the official foursquare mobile apps use.

Squaredar was built over the hack day weekend on Google AppEngine in Python, using foursquare’s Push API and Mike Lewis’s great foursquare API library.

After a person authorizes Squaredar, the app receives a real-time notification every time the user checks in. Squaredar automatically acknowledges this message and pushes the check-in object onto an AppEngine TaskQueue. This is to ensure the request from foursquare’s servers doesn’t time out. A separate handler picks up the task and calls /checkins/recent with the user’s location. The call returns a list of recent check-ins by friends, with an additional “distance” parameter, indicating how far each friend is from the location specified in the request.

I keep a FriendDistance object for each (user, friend) pair in my database. After receiving the current list of distances, I fetch all pairs for the user and update a field tracking the friend’s average distance. I also update two fields for remembering their last seen distance and time. Once I’ve done this, I kick off a new task to handle notifications.

The notification task re-fetches the FriendDistance objects and checks the following:

  • Friend’s last seen distance is at least 100 times smaller than the average distance
  • Last seen distance is > 0 (no point notifying you every time you check in with a friend)
  • Friend’s last seen time is at least 6 hours ago
  • Last time Squaredar sent an alert about this friend was over a week ago.

If any of the FriendDistance objects meet all of these criteria, Squaredar sends an e-mail to the user listing the friends who are closer than normal.

Squaredar works pretty well, but it wouldn’t be a hack if there wasn’t a long list of improvements I’d like to make. For example, I’d like to switch from an average distance to a moving average distance over the last X sightings, so a trip to China doesn’t skew the average distance forever. I’d also like to remember if the user has checked in at the same place as the friend recently, so you don’t get notified about co-workers who stepped for lunch or someone you’ve already caught up with. But, I’m happy with what I built over the weekend, and I can’t wait for our next Hack Day!

P.S. Love hacking? We’re hiring!

-Akshay Patil (@ak)

HeapAudit – JVM Memory Profiler for the Real World

HeapAudit is not a monitoring tool, but rather an engineering tool that collects actionable data – information sufficient for directly making code change improvements. It is created for the real world, applicable to live running production servers.

HeapAudit is a foursquare open source project designed for understanding JVM heap allocations. It is implemented as a Java agent built on top of ASM.

Understanding JVM Memory Allocations

Performance and scalability issues are generally attributed to bottlenecks in code execution (CPU), memory allocations (RAM) and I/O (disk, network, etc.). Ignoring system level performance problems (i.e. memory fragmentation, NUMA, etc.), memory issues are typically caused by the rate and size of heap allocations. It is a concept easily understood, but hard to track, especially in complex systems. Unlike CPU and I/O, memory allocations often cannot be understood by taking a short-windowed snapshot. Objects currently on the heap or about to get released may be allocated from a period of time ago, requiring more sophisticated analysis to back trace and pinpoint root causes.

Garbage Collection & Memory Profilers

Excessive Garbage Collection (GC) frequently becomes one of the main sources of memory related performance bottlenecks in Java Virtual Machine (JVM). The JVM runtime exposes hooks and counters to examine the heap objects as well as GC rate and sizes. Those pieces of information generally provide an overview of the symptom, but not the root of the performance issue. In other words, it informs of the effect, not the cause. In small code bases, this may be sufficient to identify the problem. In large and complex systems, like those we run here at foursquare, however, this type of information is rather non-actionable.

For instance, if the GC information tells you the JVM is garbage collecting hundreds of thousands of String objects every second, or that the JVM heap summary shows several millions of active String objects, it is not apparent where those String objects were allocated from.

There exist several sophisticated JVM profilers that address this by showing where each of the objects are allocated from, thus pointing out the cause as oppose to only the effect. This is typically achieved by code instrumentation or sampling JVM processes and associating heap allocations to callstacks. This empowers engineers with abundant data to examine JVM-wide allocations and analyze hotspots. Unfortunately, the downside for acquiring such complete information is the overhead. A sub-millisecond operation may take several minutes while also generating gigantic log files. The log becomes even harder to comprehend if concurrent logic where to execute during the collection. In other words, it’s incredibly resource consuming to process and doesn’t always help identify the source of the problem.

As a result, these JVM profilers become restricted for pre-production internal engineering purposes where the environment is highly controlled. Hence, these tools are unsuitable for scenarios like production servers where the issue is triggered by some non-anticipated user driven set of activities.

JVM hooks and counters JVM memory profiling tools
pros natively supported by runtime, no extra overhead provides complete understanding of memory allocations mapped to callstacks
cons relatively insufficient and non-actionable information execution is extremely slow and only suitable for internal engineering purposes

At foursquare, we have tried a handful of memory profilers in conjunction with JVM GC information to analyze our memory usage and have mostly come to the following conclusions:

  • GC information tells us we allocate too much memory too frequently;
  • Memory profiler data is so gigantic we do not have the man power to analyze through all the logs to understand most of the allocation patterns;
  • Our primary programming language, Scala, generates a massive amount of anonymous functions, making it extremely hard to associate allocations to source code;
  • It is nearly impossible to build a system to track memory regressions due to code changes by processing over the large profiler data or coarse GC data; and,
  • Of the small amount of data we manually analyzed, it is hard to distinguish what portion ties back to particular high level logic (i.e. specific endpoint request, etc.)

HeapAudit Java Agent

HeapAudit was created at foursquare precisely to address the fact that we needed a tool which gives us enough information to understand our allocation patterns, but at the same time can be applied to our production machines to understand the heap activities caused by our users.

At foursquare, we wrap the HeapAudit recorder around all user driven requests, attributing all heap allocations during the entire request to this particular recorder.

handleRequest() {

    HeapQuantile recorder = new HeapQuantile();

    // Register to record on local thread only
    HeapRecorder.register(recorder, false);

    try {
        // process request...
    } finally {
        // Make sure to unregister recording from local thread;
        // otherwise reference to the recorder will leak
        HeapRecorder.unregister(recorder, false);
    }

    // Tally heap allocations for the local thread with your
    // customized log output function
    log(recorder.tally(false, false));

}

We enable the HeapAudit java agent on one machine in our server pool at all times and monitor for build to build allocation changes or anomalies. We’ve also enabled this for our internal tests and staging servers to immediately notify engineers if we have an unexpected surge in allocations due to code changes.

The HeapQuantile recorder used in the above example stores allocation statistics broken into quantiles. This is lightweight and performs reasonably well under high volume of heap allocations. HeapAudit does not log the allocations in its own proprietary format. It is up to the consumer of the library to store the information and optionally further dissect the data.

The recorders can be registered globally to capture all allocations across all threads or manually registered to related threads (like when passing execution to a child thread for background processing). This is precisely what we do at foursquare to correlate all heap allocations pertaining to specific endpoints.

Performance Overhead

The performance overhead is highly dependent on the code the recorders audit over. In particular, the overhead is directly tied to the concentration of heap allocation bytecodes within a block of source code. Anecdotally, when HeapAudit is applied to the foursquare production servers, we observe 50~100% latency overhead. This is much lower than the other JVM memory profilers we’ve examined and within reasonable range for servicing actual user requests on production servers.

License and Availability

HeapAudit is open sourced under the Apache License and can be used against any JVM processes. See https://github.com/foursquare/heapaudit for more information.

Have feedback or suggestions? Let us know! And if this is interesting to you, we’re hiring!

- Norbert Hu (@norberthu)

Show and Tell: MongoDB at foursquare

On Friday 12/9, @cooperb gave a talk at the MongoSV 2011 conference covering our experiences deploying MongoDB on Amazon Web Services, including some of the operational tricks we use to keep our database servers highly performant.

Watch the video:

The slides are here.

If you like this sort of work and feel up for a challenge, we’re hiring! Our Site Reliability Engineering team (@rjoseph, @inbredrevenge, @waxcorp, and @cooperb) is where the buck stops with respect to foursquare latency and availability.

Websites are clients, too!

Historically, our API and our website have shared code and lived in the same binary (Scala, Lift), but in many other ways they were developed independently, with the API focusing primarily on the needs of our client teams. The site redesign and recently launched website features like the homepage map, lists, and notifications, have brought them closer together. With these features we’ve begun consuming our own public APIs, via JavaScript, directly from the website.

This strategy offers two important benefits. First, using the API directly from the client, we ensure that there’s only one code path for any given action. Second, it reinforces our commitment to keeping the API a first class citizen that is totally up-to-date. The deep link between the API and the website ensures they move forward in unison.

Crossing domains

To get there, we had to overcome the challenges associated with cross-domain (technically, cross sub-domain) communication between foursquare.com (web) and api.foursquare.com (API). Although we could theoretically make our API available on the same domain, doing so would undermine the security and production isolation benefits of our current setup.

Our API supports CORS, but not every browser we want to support does. To work around this we used a common technique in which an iframe hosted on api.foursquare.com is embedded on the foursquare.com web pages. The iframe executes a simple JavaScript statement setting document.domain to foursquare.com — notably the same as the web domain. Through the magic of the same-origin policy, this technique enables inter-frame communication while still allowing the iframe to make AJAX requests back to its original domain, in this case, api.foursquare.com. When we need to make api requests we can just use the iframe’s XMLHttpRequest object from foursquare.com pages.

<iframe onload="fourSq._tempIframeCallback()"
        src="https://api.foursquare.com/xdreceiver.html">
  <html>
    <head></head>
    <body>
      <script type="text/javascript">
        document.domain='foursquare.com'
      </script>
    </body>
  </html>
</iframe>

Getting some backbone

Then came to assembling the JavaScript code base around our API. To do this we’re taking advantage of jQuery, Backbone.js, Underscore.js, and the Closure Compiler.

We use the Backbone.js library to create model classes for the entities in foursquare (e.g. venues, checkins, users).  Backbone’s model classes provide a simple and lightweight mechanism to capture object data and state, complete with the semantics of classical inheritance.  As an example, fourSq.api.models.Venue is a structured JavaScript representation of the raw JSON returned by the venues API endpoint.

fourSq.api.models.Venue = fourSq.api.models.Model.extend({
  name: function() { return this.get('name'); },
  contact: function() { return this.get('contact'); },
  location: function() { return this.get('location'); },
  mayor: function() { return this.get('mayor'); },
  verified: function() { return this.get('verified'); },
  stats: function() { return this.get('stats'); },
  hereNow: function() { return this.get('hereNow'); },
  ...
});

From the code above, fourSq.api.models.Model is a simple subclass of Backbone.model that we use to provide some common convenience functionality across our code.  We also enumerate direct accessor methods for our fields as wrappers to Backbone’s regular attribute get method.  We find that this technique both removes extraneous syntax from our code and also serves to self-document the model schema, which in turn substantially simplifies maintenance, testing, and discovery. As an added benefit, it plays a little bit nicer with the closure compiler.

Instead of using Backbone.sync, we chose to write a complete service layer (fourSq.api.services) api that abstracts the underlying AJAX calls and raw argument/response types of the API.  The service layer allows us to enumerate the available APIs and funnel all service requests through a consistent code path.  Additionally, the service layer handles translating model objects to and from raw JSON.

Here’s an example of the implementation for the badges endpoint in the UserService.  Virtually all service endpoints have the same method signature; they take a map containing arguments that will be transferred in the underlying Ajax request, as well as success and error asynchronous callbacks.

badges: function(data, success, error) {
  var deserializer = function(data) {
    var sets = data.sets || {};
    var badges = data.badges || {};

    badges = _.map(badges, function(value) {
      return new fourSq.api.models.Badge(value);
    });

    return new fourSq.api.models.AnonModel({
      "sets" : sets,
      "badges" : badges
    });
  };

  fourSq.api.services.service_(
    fourSq.apiPaths.users.badges(data.id), data, {
      success: fourSq.api.services.wrapOnSuccess_(
                 deserializer, success, error),
      error: fourSq.api.services.wrapOnError_(error)
  });
}

In the code above, the success and error callback arguments are not passed directly to the underlying Ajax layer. The callbacks are wrapped to simplify the response object types that are ultimately received by the client code. The success callback, for example, is wrapped and paired with a function that deserializes the raw JSON response from the server into Backbone models keeping the client layer fully abstracted from the raw transfer format. The deserializer makes use of another helpful technique, the fourSq.api.models.AnonModel. The AnonModel mechanism helps keep our attribute access syntax consistent even in cases where we haven’t previously declared a model class – a function accessor is created for each key. In this example, result.badges() would return an array of badges.  (Oh yeah, and we’re using Underscore.js … it’s super amazingly helpful.)

Compiler? I hardly even knew ‘er

As we continue to grow our JavaScript codebase, we want to insure that we can iterate quickly while still developing modular, performant, and robust code. To help us reach these goals, we’ve integrated the Closure compiler tightly into our development environment. We run it in process and compile JavaScript on demand. Our code, whether served in production or during local development, is run through compilation that catches missing properties, typos, invalid method arguments, and much much more. The end result is not only minified code, but code that has been optimized. Lastly, since not every page needs all the javascript, we split the compiled output into modules that can be synchronously or asynchronously loaded as necessary.

Overall we’re happy with the first few features launched using this system, and plan on continuing to incorporate it going forward. If you like hacking on JavaScript and have a keen eye for awesome user interaction, we’re hiring!

- Mike Singleton (@msingleton), Matt Kamen (@losfumato), Dolapo Falola (@dolapo)

 

Redesigning the developer website

Earlier this week we launched the redesign of developer.foursquare.com, a project I’ve worked on for the past two months as part of my internship on the platform team.

Above: our redesigned developer site

On my first day, Akshay, our platform evangelist, told me that the redesign would be my primary project and I’d pretty much be the only one working on it. At first, this was really intimidating. As a junior at the University of Pennsylvania, I have only a handful of technical experiences under my belt compared to the years and years of expertise that everyone on the platform and engineering teams bring to the table.

Continue reading →

Understanding human mobility with machine learning and a billion check-ins

At foursquare, we believe there is a huge opportunity to apply machine learning algorithms to the collective movement patterns of millions of people and build new services which help people better understand and connect with places.

foursquare is now aware of 25 million places worldwide, each of which can be described by unique signals about who is coming to these places, when, and for how long. We employ a variety of machine learning algorithms at foursquare to distill these signals into useful data for our app and our platform.

In the slides below, we talk briefly about the data at foursquare and some interesting applications of machine learning. Enjoy!

Machine Learning and Big Data at Foursquare

- @metablake

Behind the scenes of our “week of check-ins” visualization

One of the great things about working at foursquare is having access to a huge dataset that — when viewed in aggregate — can reveal some really amazing patterns in how people use the product across the globe. To celebrate our one-billionth check-in, we wanted to create something that showed the scope and breadth of that data. In past dataviz projects, we’ve looked at the total history of foursquare and analyzed specific aspects of the check-in data. For this project, we decided to limit it to one typical week of foursquare usage and keep the visualization simple, to give people a chance to draw their own observations. What we decided on was a straightforward map, with time-lapsed checkins animating across it. We chose to color-code the venue categories to reveal a bit about the specific activity going on in different places throughout the day. Here are some of the tools used to create this visualization: Continue reading →

Slashem: A Rogue-like, type-safe Scala DSL for querying Solr

Slashem is our new spiffy Rogue-like type-safe* DSL for querying Solr. If you are curious about Rogue, we’ve talked about it in some of our previous blog posts (launch, Going Rogue, Part 2: Phantom Types). Solr is an open source full-text search platform from the Apache Lucene project which we use (not surprisingly) for searching things. Here’s a look at how some simple queries look like in slashem:

SolrVenue where (_.default contains "club")
      useQueryType("edismax")
      phraseBoost(_.name,2.5)
SolrUser where (_.fullname contains "jon")
      boostQuery(_.friend_ids in List(110714,
                                      1048882,
                                      2775804,
                                      364701,
                                      33).map(_.toString))
      useQueryType(“edismax”)
SolrUser where (_.fullname contains “jon-shea”)
SolrVenue where (_.name eqs “Blue Bottle”)

The resulting Solr queries they make are:

/solr/select?q=(club)&defType=edismax&start=0&rows=10&pf=name^2.5&pf2=name^2.5&pf3=name^2.5
/solr/select?q=fullname:(jon)&defType=edismax&start=0&rows=10&bq=friend_ids:("110714" OR "1048882" OR "2775804" OR "364701" OR "33")
/solr/select?q=fullname:(jon\-shea)&defType=edismax&start=0&rows=10&bq=friend_ids:("110714" OR "1048882" OR "2775804" OR "364701" OR "33")
/solr/select?q=name:(”Blue Bottle”)&start=0&rows=10

As you can see Slashem also takes care of any escaping that might be necessary :)
If you want to skip the chit-chat and get right to it, you can check out our github project for slashem at https://github.com/foursquare/slashem.

Background:

In the dark ages, before Rogue, all of our queries we’re written by manually crafting strings. Sure they were lovingly hand crafted, and some would argue it’s not the same with a robot, but there were mistakes, too. A query would come out a little too large, or try and place two separate limits on the same query. Sometimes we would even query against a field that didn’t exist. Worst of all, you had to wait until runtime to find out that your precious hand-crafted query just didn’t make the cut. The first to modernize was our Mongo queries with the introduction of Rogue, and now with Slashem our Solr queries are now generated by an unfeeling type-safe robot as well.

Much like in Rogue, you start by making a record definition. The types are pretty closely mapped to the ones you will have in your Solr’s schema.xml. Here is a version of our Event schema:

<field name="id" type="string" indexed="true" stored="true" required="true" multiValued="false"/>
<field name="venueid" type="string" indexed="true" stored="true" required="true" multiValued="false"/>
<field name="name" type="text" index="true" stored="true" required="false" />
<field name="tags" type="text" index="true" stored="false" required="false" />
<!-- event time -->
<field name="start_time" type="date" index="true" stored="true" required="false" />
<field name="expires_time" type="date" index="true" stored="true" required="false" />
<field name="lat" type="double" indexed="true" stored="true"/>
<field name="lng" type="double" indexed="true" stored="true"/>
<field name="geo_s2_cell_ids" type="text_ws" indexed="true" stored="false" omitNorms="true" omitTermFreqAndPositions="true"/>

And our resulting model looks like:

class SolrEvent extends SolrSchema[SEvent] {
    def meta = SolrEvent

    // The default field will result in queries against the default
    // field or if a list of fields to query has been specified to
    // an edismax query then the query will be run against this.
    object default extends SolrDefaultStringField(this)

    // This is a special field to allow for querying of *:*
    object metall extends SolrStringField(this) {
        override def name="*"
    }
    object id extends SolrObjectIdField(this)
    object venueid extends SolrObjectIdField(this)
    object lat extends SolrDoubleField(this)
    object lng extends SolrDoubleField(this)
    object name extends SolrStringField(this)
    object tags extends SolrStringField(this)
    object start_time extends SolrDateTimeField(this)
    object expires_time extends SolrDateTimeField(this)
    object geo_s2_cell_ids extends SolrGeoField(this)
}

A simple query for all events containing “dj hixxy” would be:

SolrEvent where (_.name contains “dj hixxy”)

We might also care about all of the events at a given venue:

SolrEvent where (_.venueid eqs
                 new ObjectId("4dc5bc4845dd2645527930a9"))

We could follow up with a query for all events between two dates:

SolrEvent where (_.start_time inRange(startTime,endTime))

Another important part of slashem, in addition to writing queries, is that the type checker verifies that queries are sensible. For example we can see that it refuses to perform a date range search on the name field:

SolrEventTest where (_.name inRange(startTime,endTime))

results in:

error: type mismatch;
found : org.joda.time.DateTime
required: String

Which helps keep us from accidentally writing bad queries :)

And while doing a geo filter query on the geo_s2_cell_ids field compiles:

SolrEventTest where (_.default contains “DJ Hixxy”)
                    useQueryType("edismax")
                    filter(_.geo_s2_cell_ids inRadius(geoLat,
                                                      geoLong,
                                                      1))

Trying to run it against the tags field does not compile. Yay for the compiler :)

Specifying return fields

While the above queries have been lovely, some people actually want return data from their queries (crazy!). You can specify the fields to fetch with a simple fetchField like so:

SolrEvent where (_.default contains “pirates”) fetchField(_.name)

Executing Queries

Executing your queries is simple. For a blocking request, simply take the query and then call fetch() on it. The query is then executed against one of your Solrs (most likely, you only have one, but thanks to finagle we support multiple backends). We also provide a fetchBatch method, which takes a function and applies it over the Solr results in a given batch size. If you want to perform a non-blocking request and get a future back, you can call .fetchFuture() (thanks again to finagle for making this easy :) ).

The return type is a generic SearchResults. The general information (like # of results) is in the responseHeader. From the response, you can extract your actual resutling documents in a few ways.

  1. The most sane way is to call .results on the response, yielding a list of instances of your schema class.
  2. If you just want a list of object ids which are named “id” then you can call .oids. This option exists mostly because it was an frequent important code path for us.
  3. If you don’t want any of the record stuff getting in your way for some reason, you can call .docs and get an array of HashMap’s of [String,Any]. I don’t recommend this.

Contributions (code, comments, docs) always welcome!

- @holdenkarau, @jliszka, @jonshea, @kabragovind