-
Notifications
You must be signed in to change notification settings - Fork 26
Improve QuerySet performance by removing limit on server-side chunking #347
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR removes the use of Django's GET_ITERATOR_CHUNK_SIZE
constant in favor of MongoDB's native document fetching policy. The change allows MongoDB to handle cursor batch sizing according to its own optimization strategies rather than forcing a fixed chunk size.
Key changes:
- Removed
GET_ITERATOR_CHUNK_SIZE
import and usage from the SQL compiler - Implemented adaptive chunk sizing that starts with MongoDB's default and adjusts based on document size
- Updated documentation to reflect the change in cursor iteration behavior
Reviewed Changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 4 comments.
File | Description |
---|---|
django_mongodb_backend/compiler.py |
Updated execute_sql and cursor_iter methods to remove fixed chunk sizing and implement adaptive batching |
django_mongodb_backend/query.py |
Minor formatting change to method call |
docs/source/releases/5.2.x.rst |
Added release note documenting the removal of GET_ITERATOR_CHUNK_SIZE |
docs/source/releases/5.1.x.rst |
Added release note documenting the removal of GET_ITERATOR_CHUNK_SIZE |
Comments suppressed due to low confidence (1)
django_mongodb_backend/compiler.py:324
- Using underscore '_' for the chunk_size parameter is unclear. Consider using a more descriptive name like 'unused_chunk_size' or remove the parameter entirely if it's not needed.
def cursor_iter(self, cursor, _, columns):
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there some context for this PR to better understand the desired behavior and the reasoning behind it?
docs/source/releases/5.1.x.rst
Outdated
@@ -11,6 +11,10 @@ Django MongoDB Backend 5.1.x | |||
the ``base_field`` uses a database converter. | |||
- Fixed ``RecursionError`` when using ``Trunc`` database functions on non-MongoDB | |||
databases. | |||
- Removed ``GET_ITERATOR_CHUNK_SIZE`` as default from | |||
:meth:`SQLCompiler.execute_sql() <django_mongodb_backend.compiler.SQLCompiler.execute_sql>`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SQLCompiler
is a private API and generally not mentioned in the docs, thus linking to it using ":meth:" won't work. You should instead try to describe the practical effect of this change at a higher level.
Also, add a section for QuerySet.iterator()
after explain()
and explain how chunk_size
differs from the usual Django behavior.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added. Let me know if it's clearer.
Yep. I'll attach all of the context shortly. |
django_mongodb_backend/compiler.py
Outdated
yield chunk | ||
chunk = [] | ||
yield chunk | ||
yield [self._make_result(row, columns)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🤔 , trying to understand, with the change made in the line 261: cursor.batch_size(chunk_size)
should be sufficient, right? this is only for consistency?
I mean: remove the line 261 is the fix.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think what Emanuel is getting at is that chunk size can be used in two ways, server-side and client-side. I believe cursor.batch_size(chunk_size)
is about the server, while this method (among others) control how the rows that the server returns are turned into model instances. I'm not sure if we want to/need to change the behavior client-side, and I'm not certain whether this change is sufficient for that, but I have to study it more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm fine with deferring the changes to cursor_iter since this part is client-side model conversion.
This does result in more yields, meaning more ventures up and down the stack, but we can play around with how that affects performance in a different PR. (This I imagine is not as major).
But the benchmark suite may be the best proxy for deciphering that. Be that the case, we may stand to benefit from the optimization of loosely calculating the number of documents that could have been fetched and then loading them into the model. Once they exist on the cursor, they might as well get flushed!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the PR description you wrote, "PyMongo already handles its own performant chunking policy." But I'm assuming that MongoDB uses server-side cursors, so it's the server, not PyMongo, that does the chunking of up to 16MB of data?
So if we have no chunking in cursor_iter()
, we could have some 16MB of model instances instantiated at a time. This seems it would be quite a bit different from using QuerySet.iterator()
's default behavior of chunk_size=2000
which according to the documentation for QuerySet.iterator()
:
Assuming rows of 10-20 columns with a mix of textual and numeric data, 2000 is going to fetch less than 100KB of data, which seems a good compromise between the number of rows transferred and the data discarded if the loop is exited early.
I might have some details wrong here, but at least we can clarify the desired behavior before confirming that the implementation matches it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd agree with Emanuel and Tim. The major perf issue seems like it should be resolved by just the cursor.batch_size(chunk_size)
change, right? If so, I'd suggest defering the cursor_iter changes to a future ticket (if at all).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good to me. I'll revert the function.
Funny enough, I just finished running the perf analysis on it. We'll probably fold it in, still (just not for this PR).
Here's both results. Looks like just iterating on the cursor without client-side chunking has generated slightly better performance than the client-side chunking. However, this doesn't factor in more nuanced iterator use.
c65b7e5
to
defec37
Compare
issues addressed and client-side chunking will be deferred to a later PR
Summary
We noticed large dips in performance when attempting to retrieve multiple documents (n > 10_000) for any query. Digging into our logic, we found from setting our
cursor.batch_size(100)
which would limit database fetches to retrieve at most 100 documents. In cases of 10_000+ documents to fetch, that would generate 100 individual queries to the database.We do not need to configure batch sizing as PyMongo already handles its own performant chunking policy via the cursor iterator. This PR removes any batching on the iterator and prevents end-user configuration of batch sizing. This significantly reduces the number of requests and leaves it to the server to determine how many documents to return.
Changes in this PR
_make_result
output of a "row" (document) from the cursor object.Screenshots
Before
After