Skip to content

Improve QuerySet performance by removing limit on server-side chunking #347

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Jul 26, 2025

Conversation

Jibola
Copy link
Contributor

@Jibola Jibola commented Jul 24, 2025

Summary

We noticed large dips in performance when attempting to retrieve multiple documents (n > 10_000) for any query. Digging into our logic, we found from setting our cursor.batch_size(100) which would limit database fetches to retrieve at most 100 documents. In cases of 10_000+ documents to fetch, that would generate 100 individual queries to the database.

We do not need to configure batch sizing as PyMongo already handles its own performant chunking policy via the cursor iterator. This PR removes any batching on the iterator and prevents end-user configuration of batch sizing. This significantly reduces the number of requests and leaves it to the server to determine how many documents to return.

Changes in this PR

  • Removed GET_ITERATOR_CHUNK_SIZE import and usage from the SQL compiler
  • yield the _make_result output of a "row" (document) from the cursor object.
  • Updated documentation to reflect the change in cursor iteration behavior

Screenshots

>>> def timer(func, *args, **kwargs):  
...      """Time a function call and return (result, execution_time in SECONDS)"""  
...      start = time.perf_counter()  
...      result = func(*args, **kwargs)  
...      end = time.perf_counter()  
...      exec_time = end - start  
...      return result, exec_time  
... 

Before

>>> timer(lambda: Author.objects.filter())
(<MongoQuerySet [<Author: >, *REMOVED FOR EASY READING*, '...(remaining elements truncated)...']>, 0.0006403750012395903)
>>> timer(lambda: list(Author.objects.filter()[:101])[0])
(<Author: Author object (6877c9c7d04f764442793292)>, 0.04128820799815003)
>>> timer(lambda: list(Author.objects.filter()[:10_001])[0])
(<Author: Author object (6877c9c7d04f764442793292)>, 1.9592289169959258)
>>> timer(lambda: list(Author.objects.filter()[:100_001])[0])
(<Author: Author object (6877c9c7d04f764442793292)>, 20.37065891599923)
>>> timer(lambda: list(Author.objects.filter()[:1_000_001])[0])
(<Author: Author object (6877c9c7d04f764442793292)>, 20.618212166999)

# JOIN LOGIC
>>> timer(lambda: list(Author.objects.filter(address__city="London")[:1_000_001])[0])
(<Author: Author object (6877c9c7d04f764442793292)>, 34.79546758400102)

After

>>> import time
>>> timer(lambda: Author.objects.filter())
(<MongoQuerySet [<Author: >, *REMOVED FOR EASY READING* '...(remaining elements truncated)...']>, 0.0009210830030497164)
>>> timer(lambda: list(Author.objects.filter()[:101])[0])
(<Author: Author object (6877c9c7d04f764442793292)>, 0.04708974999812199)
>>> timer(lambda: list(Author.objects.filter()[:10_001])[0])
(<Author: Author object (6877c9c7d04f764442793292)>, 0.23040095800388372)
>>> timer(lambda: list(Author.objects.filter()[:100_001])[0])
(<Author: Author object (6877c9c7d04f764442793292)>, 1.136685874997056)
>>> timer(lambda: list(Author.objects.filter()[:1_000_001])[0])
(<Author: Author object (6877c9c7d04f764442793292)>, 1.4290711669964367)
>>> 

# JOIN LOGIC
>>> timer(lambda: list(Author.objects.filter(address__city="London")[:1_000_001])[0])
(<Author: Author object (6877c9c7d04f764442793292)>, 14.903681124997092)

Copy link

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR removes the use of Django's GET_ITERATOR_CHUNK_SIZE constant in favor of MongoDB's native document fetching policy. The change allows MongoDB to handle cursor batch sizing according to its own optimization strategies rather than forcing a fixed chunk size.

Key changes:

  • Removed GET_ITERATOR_CHUNK_SIZE import and usage from the SQL compiler
  • Implemented adaptive chunk sizing that starts with MongoDB's default and adjusts based on document size
  • Updated documentation to reflect the change in cursor iteration behavior

Reviewed Changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 4 comments.

File Description
django_mongodb_backend/compiler.py Updated execute_sql and cursor_iter methods to remove fixed chunk sizing and implement adaptive batching
django_mongodb_backend/query.py Minor formatting change to method call
docs/source/releases/5.2.x.rst Added release note documenting the removal of GET_ITERATOR_CHUNK_SIZE
docs/source/releases/5.1.x.rst Added release note documenting the removal of GET_ITERATOR_CHUNK_SIZE
Comments suppressed due to low confidence (1)

django_mongodb_backend/compiler.py:324

  • Using underscore '_' for the chunk_size parameter is unclear. Consider using a more descriptive name like 'unused_chunk_size' or remove the parameter entirely if it's not needed.
    def cursor_iter(self, cursor, _, columns):

Copy link
Collaborator

@timgraham timgraham left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there some context for this PR to better understand the desired behavior and the reasoning behind it?

@@ -11,6 +11,10 @@ Django MongoDB Backend 5.1.x
the ``base_field`` uses a database converter.
- Fixed ``RecursionError`` when using ``Trunc`` database functions on non-MongoDB
databases.
- Removed ``GET_ITERATOR_CHUNK_SIZE`` as default from
:meth:`SQLCompiler.execute_sql() <django_mongodb_backend.compiler.SQLCompiler.execute_sql>`.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SQLCompiler is a private API and generally not mentioned in the docs, thus linking to it using ":meth:" won't work. You should instead try to describe the practical effect of this change at a higher level.

Also, add a section for QuerySet.iterator() after explain() and explain how chunk_size differs from the usual Django behavior.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added. Let me know if it's clearer.

@Jibola
Copy link
Contributor Author

Jibola commented Jul 24, 2025

Is there some context for this PR to better understand the desired behavior and the reasoning behind it?

Yep. I'll attach all of the context shortly.

@Jibola Jibola changed the title Remove use of GET_ITERATOR_CHUNK_SIZE in favor of MongoDB native document fetching policy Ignore chunk_size usage in favor of MongoDB's native cursor iteration Jul 24, 2025
yield chunk
chunk = []
yield chunk
yield [self._make_result(row, columns)]
Copy link
Collaborator

@WaVEV WaVEV Jul 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤔 , trying to understand, with the change made in the line 261: cursor.batch_size(chunk_size) should be sufficient, right? this is only for consistency?

I mean: remove the line 261 is the fix.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think what Emanuel is getting at is that chunk size can be used in two ways, server-side and client-side. I believe cursor.batch_size(chunk_size) is about the server, while this method (among others) control how the rows that the server returns are turned into model instances. I'm not sure if we want to/need to change the behavior client-side, and I'm not certain whether this change is sufficient for that, but I have to study it more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm fine with deferring the changes to cursor_iter since this part is client-side model conversion.

This does result in more yields, meaning more ventures up and down the stack, but we can play around with how that affects performance in a different PR. (This I imagine is not as major).

But the benchmark suite may be the best proxy for deciphering that. Be that the case, we may stand to benefit from the optimization of loosely calculating the number of documents that could have been fetched and then loading them into the model. Once they exist on the cursor, they might as well get flushed!

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the PR description you wrote, "PyMongo already handles its own performant chunking policy." But I'm assuming that MongoDB uses server-side cursors, so it's the server, not PyMongo, that does the chunking of up to 16MB of data?

So if we have no chunking in cursor_iter(), we could have some 16MB of model instances instantiated at a time. This seems it would be quite a bit different from using QuerySet.iterator()'s default behavior of chunk_size=2000 which according to the documentation for QuerySet.iterator():

Assuming rows of 10-20 columns with a mix of textual and numeric data, 2000 is going to fetch less than 100KB of data, which seems a good compromise between the number of rows transferred and the data discarded if the loop is exited early.

I might have some details wrong here, but at least we can clarify the desired behavior before confirming that the implementation matches it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd agree with Emanuel and Tim. The major perf issue seems like it should be resolved by just the cursor.batch_size(chunk_size) change, right? If so, I'd suggest defering the cursor_iter changes to a future ticket (if at all).

Copy link
Contributor Author

@Jibola Jibola Jul 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good to me. I'll revert the function.
Funny enough, I just finished running the perf analysis on it. We'll probably fold it in, still (just not for this PR).

Here's both results. Looks like just iterating on the cursor without client-side chunking has generated slightly better performance than the client-side chunking. However, this doesn't factor in more nuanced iterator use.

no_chunking.json
with_chunking.json

@Jibola Jibola requested review from timgraham, WaVEV and NoahStapp July 25, 2025 19:51
@timgraham timgraham changed the title Ignore chunk_size usage in favor of MongoDB's native cursor iteration Improve QuerySet performance by removing limit on server-side chunking Jul 25, 2025
@timgraham timgraham force-pushed the remove-get-default-chunk-size branch from c65b7e5 to defec37 Compare July 25, 2025 20:01
@Jibola Jibola dismissed NoahStapp’s stale review July 26, 2025 01:48

issues addressed and client-side chunking will be deferred to a later PR

@Jibola Jibola merged commit 8077041 into main Jul 26, 2025
17 checks passed
@timgraham timgraham deleted the remove-get-default-chunk-size branch July 26, 2025 13:57
timgraham pushed a commit to timgraham/django-mongodb-backend that referenced this pull request Jul 28, 2025
timgraham pushed a commit to timgraham/django-mongodb-backend that referenced this pull request Jul 28, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants