Skip to content

[SPARK-53098][SQL] DeduplicateRelations shouldn't remap expressions if old ExprId still exists in output #51812

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

mihailotim-db
Copy link
Contributor

@mihailotim-db mihailotim-db commented Aug 4, 2025

What changes were proposed in this pull request?

DeduplicateRelations shouldn't remap expressions from oldExprId -> newExprId if oldExprId still exists in output of one of the children.

Why are the changes needed?

This is needed in order to implement ExprId comparison framework for single-pass analyzer: https://issues.apache.org/jira/browse/SPARK-53099.

For example, if we have a dataframe self join plan like:

Join(a#1 = a#1)
+-Project(a#1, b#2)
+-Project(a#1, b#2)

Current implementation of DeduplicateRelations would regenerate the right hand side to a#3, b#4 and would then also remap the join condition to a#3 = a#3. However, this behavior can't be matched in single-pass because there is no actual need to remap the join condition. This PR proposes a change to only remap the condition if old ExprId is no longer present in any of the child outputs.

Does this PR introduce any user-facing change?

No

How was this patch tested?

Added a new test to DataframeSelfJoinSuite

Was this patch authored or co-authored using generative AI tooling?

No

@github-actions github-actions bot added the SQL label Aug 4, 2025
@mihailotim-db mihailotim-db force-pushed the mihailotim-db/fix_dedup_relations branch from f196e67 to 0482f79 Compare August 4, 2025 14:17
@github-actions github-actions bot added the BUILD label Aug 4, 2025
@mihailotim-db mihailotim-db force-pushed the mihailotim-db/fix_dedup_relations branch from 0482f79 to e49d166 Compare August 4, 2025 14:18
@github-actions github-actions bot removed the BUILD label Aug 4, 2025
@mihailotim-db mihailotim-db force-pushed the mihailotim-db/fix_dedup_relations branch 2 times, most recently from a9d6fad to 66cd5cd Compare August 4, 2025 15:59
@mihailotim-db mihailotim-db changed the title fix [SPARK-53098][SQL] DeduplicateRelations shouldn't remap expressions if old ExprId still exists in output Aug 4, 2025
@mihailotim-db mihailotim-db force-pushed the mihailotim-db/fix_dedup_relations branch from 66cd5cd to 534c3bf Compare August 4, 2025 16:51
@mihailotim-db mihailotim-db force-pushed the mihailotim-db/fix_dedup_relations branch from 534c3bf to e7c70aa Compare August 4, 2025 17:01
@violetnspct
Copy link

@mihailotim-db Should you be adding tests to cover the following scenarios?

  1. Mixed scenario with some ExprIds needing remapping and others preserved
  2. Edge case:
    • Multiple children with overlapping ExprIds. Important because it tests the AttributeSet merging logic. Should add test case with 3+ way joins having overlapping ExprIds.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants