You are on page 1of 13

Sign in

Get started

 ARCHIVE

 WRITE FOR US

 STYLE GUIDE

 ABOUT

 JOB BOARD

You have 2 free stories left this month. 


Sign up and get an extra one for free.

Extract Random Documents


in MongoDB While
Preserving Pagination
Because sometimes you need to grab a random
document from your database

Ivano Di Gese
Follow
May 25 · 5 min read
Photo by Patrick Fore on Unsplash

It may sound weird. It may sound extreme. It may sounds


unconventional. But extracting random documents from a
MongoDB collection is actually a common behavior that sooner or
later every programmer must be able to implement and reproduce.

First of All: Use Cases and Real


Needs
After all, you really need to be able to randomize documents
extraction.

Imagine a use case or an app where you have to randomize


information to show, even with a basic algorithm and not
necessarily purely casually. Apps show randomic or
pseudorandomic data more often than you think to cycle content
and to make them appear always fresh.

Instagram, itself, uses an approach not much different than that to


select pictures to show in their Explore section.

These picture are based on content you usually love to enjoy,


there’s no doubt about it, but the images you can see must be
rotated, shuffled, and mixed a certain way.
So even out of the algorithm or the logic you can incorporate in
this technique, your goal is often to pick up random (and genuine,
of course) documents from your collection and send them to your
client application, which is the mobile app in this case.

Here are some scenarios that may come up:

 Showing popular posts

 Extracting random news from a specific category

 Outputting sample data

 Applying the randomness to match and meet specific


goals — for example, giving your contents the same
impression rate and distribution

 Generating random content to randomize a


behaviour — maybe for testing purposes, like input
data to unit tests

The problem of pagination

Simple and very frustrating: When you pull out the first 50
documents, you want to be able to paginate data and use $skip and
the $limitcorrectly, showing coherent data in every subsequent
query and avoiding duplicate content.
So if it was purely random, you obviously could extract the same
document in every separated query because every skipped query
won’t care about what happened before and what we sent in
output previously.

The first approach would be to remember (or cache) content


emitted in every previous output. But that’s way bad. There’s no
technique to scale up this approach, and we don’t want to store
any useless data to our xbillion document collection — we want to
do it better. So how could we?

Seeding the query

If the output could be random but related to a seed or a value the


client keeps up during the pagination, we’d have it. So if we could
randomly extract data — maybe ordering by a specific input —
we’d just need to tell the client to use the same input for every
request above the first. But how?

This approach exists in the SQL actually. The popular Order by


Random()can simply apply a specific value by invoking the random

function with a value inside its parenthesis. So a query like this …


SELECT * FROM theCollection ORDER BY RANDOM(123)

… would do the trick in most RDBMS. So what about MongoDB?


MongoDB doesn’t provide any specific operators or functions to
randomize the access to the collection other than
the $sample operator.

The ‘$sample’ operator: not a solution

The $sample operator can be used inside any aggregation pipeline,


as it exists as a pipeline stage.

It’s a very straightforward operator: a very versatile and easy-to-


use way to apply randomness to our aggregation pipeline. But it’s
also very limited and basic. Here’s how it works:
db.theCollection.aggregate(
[ { $sample: { field: fieldValue } } ]
)

 uses a pseudorandom cursor to select documents, so it


$sample

basically just positions the cursor in a random position to extract


pseudorandom data — very easy and very poor complexity.

That’s why it’s not the solution to our problem because it can’t
consider previous data querying and doesn’t provide any feature to
repeat the query after the first extraction while providing the
ability to avoid duplicate documents.

The correct way

To maintain logic that’s incorporated to the query matching


criteria, the only way to meet our goal is to customize its features
and to make it extract data with a custom algorithm.
So instead of using the $sample operator, we must come back to the
seeding approach, which works starting by the idea of passing a
parameter as an input on our query and making it work as the
seed from which we determine the randomness of the query
extraction.

 The client gives us a seed

 The query use the seed to shuffle, randomize, and


then project a new field

 The query then orders the results by this new value

The real problem of that is how to make the seed related to our
documents field values. Consider these hints and suggestions:

 If the seed is a number, it can be used to formulate a


mathematical calculus, which gives us good
distribution and cardinality

 If the seed is a string, you could manipulate its value


or somehow relate its value to one of our fields

 If the seed is a date, you could calculate the time


difference with a timestamp or something like that to
randomize the new value
It’s all about you, and it’s very dependent on the collection
structure and the data modeling of your database.

From my experience, using seeds as numbers is way easier than


manipulating strings formats (characters occurrences, string
length, etc).

By using a number as a seed, it’s easy to find up a formula


to $project a new field with a random value. I personally used the
modulus (%) or the divider (/) operator with very good results,
keeping the complexity very low and avoiding overhead on the
CPU.

Conclusions
Extracting random documents from a collection could be tricky —
at least as much as the worst aggregates are to build.

But after all: Keeping things simple, obtaining a seed, and


then $projecting a new field — the one you’ll order by — could be a
good idea if you find an easy way to calculate this field value
starting from the seed value.

And obviously, the seed must be refreshed every time you want to
randomize your data back: That’s why it could be a good idea to
create a new seed value every time you ask for the first page of
results; after that, every subsequent page will use the same seed
value.
Thanks to Zack Shapiro. 

 Programming
 Mongodb
 Random
 Database
 NoSQL

55

WRITTEN BY

Ivano Di Gese
Follow
Passionate IT skills on the run: keep calm, do your stuff and code
better

Better Programming
Follow
Advice for programmers.

More From Medium


React Forms — Class vs. Functional
nathan brickett

Debugging Your iPhone Mobile Web App With Safari Dev Tools
Matthew Croak in Better Programming

What Is the Difference between Imperative and Declarative Code


Reed Barger in Code Artistry

Do It Yourself XD Plugin(s) for Beginners: Part 1


Steve Kwak in Adobe Tech Blog
How to Integrate Stripe Payment Platform with iOS App on a Firebase Backend
Osaretin Idele in Weekly Webtips

Breaking down Reduce() in JS for beginners


Jack Hilscher

Immediately Invoked Function Expression (IIFE) in JavaScript  .


Igor Łuczko

Unit testing Firebase Firestore & Cloud Functions


Kyle Welsby in JavaScript In Plain English
Discover Medium

Welcome to a place where words matter. On Medium, smart voices and original
ideas take center stage - with no ads in sight. Watch

Make Medium yours

Follow all the topics you care about, and we’ll deliver the best stories for you to
your homepage and inbox. Explore

Become a member

Get unlimited access to the best stories on Medium — and support writers while
you’re at it. Just $5/month. Upgrade

About
Help
Legal
To make Medium work, we log user data. By using Medium, you agree to
our Privacy Policy, including cookie policy.

You might also like