Skip to content

Add polars notebook #19

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 6 commits into from
Sep 3, 2024
Merged

Add polars notebook #19

merged 6 commits into from
Sep 3, 2024

Conversation

gjbex
Copy link
Owner

@gjbex gjbex commented Sep 3, 2024

Summary by Sourcery

Add a new Polars section to the project, including a README and a Jupyter notebook that explores functional differences between pandas and Polars, replicating the existing pandas notebook.

New Features:

  • Introduce a new section in the project for Polars, an alternative to pandas, with a focus on performance improvements.

Documentation:

  • Add documentation for the new Polars section, including a README that explains its purpose and contents.

Copy link

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

Copy link

sourcery-ai bot commented Sep 3, 2024

Reviewer's Guide by Sourcery

This pull request adds a new section for Polars, a high-performance alternative to pandas, to the project. It includes updates to the main README.md file and introduces a new README.md file for the Polars section, explaining its purpose and contents.

File-Level Changes

Change Details Files
Added Polars section to the main project structure
  • Included Polars in the list of available libraries/tools
  • Added a brief description of Polars as an illustration
source-code/README.md
Created a new README for the Polars section
  • Introduced Polars as an alternative to pandas with better performance
  • Described the contents of the Polars section, including a Jupyter notebook and data directory
  • Mentioned that the notebook replicates an existing pandas notebook for comparison
source-code/polars/README.md

Tips
  • Trigger a new Sourcery review by commenting @sourcery-ai review on the pull request.
  • Continue your discussion with Sourcery by replying directly to review comments.
  • You can change your review settings at any time by accessing your dashboard:
    • Enable or disable the Sourcery-generated pull request summary or reviewer's guide;
    • Change the review language;
  • You can always contact us if you have any questions or feedback.

@gjbex gjbex merged commit 5a7f3ae into master Sep 3, 2024
Copy link

@sourcery-ai sourcery-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @gjbex - I've reviewed your changes - here's some feedback:

Overall Comments:

  • There's a typo in the main README.md file. 'Kllustrations' should be 'Illustrations' in the Polars entry.
Here's what I looked at during the review
  • 🟡 General issues: 5 issues found
  • 🟢 Security: all looks good
  • 🟢 Testing: all looks good
  • 🟢 Complexity: all looks good
  • 🟡 Documentation: 1 issue found

Sourcery is free for open source - if you like our reviews please consider sharing them ✨
Help me be more useful! Please click 👍 or 👎 on each comment to tell me if it was helpful.

Comment on lines +565 to +568
"for patient in range(1, 10):\n",
" plt.plot(time_series['date'], time_series[f'temperature_{patient}'],\n",
" label=f'patient {patient}')\n",
"plt.legend();"
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion: Consider using a dynamic range for patient iteration.

The loop iterates over a hardcoded range from 1 to 10. Consider using a dynamic range based on the actual number of patients in the dataset to make the code more flexible.

Suggested change
"for patient in range(1, 10):\n",
" plt.plot(time_series['date'], time_series[f'temperature_{patient}'],\n",
" label=f'patient {patient}')\n",
"plt.legend();"
num_patients = len([col for col in time_series.columns if col.startswith('temperature_')])
for patient in range(1, num_patients + 1):
plt.plot(time_series['date'], time_series[f'temperature_{patient}'],
label=f'patient {patient}')
plt.legend()

"for patient in range(1, 10):\n",
" plt.plot(time_series['date'], time_series[f'temperature_{patient}'],\n",
" label=f'patient {patient}')\n",
"plt.legend();"
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion (bug_risk): Check for key existence in dictionary access.

When accessing dictionary keys using formatted strings, ensure that the keys exist to avoid potential KeyErrors. Consider using a safer method like dict.get() with a default value.

Suggested change
"plt.legend();"
for patient in range(1, 10):
plt.plot(time_series['date'], time_series.get(f'temperature_{patient}', []),
label=f'patient {patient}')
plt.legend()

}
],
"source": [
"time_series.select(pl.col('date'), pl.any_horizontal(pl.all().is_null()).alias('has_null'))"
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion (performance): Optimize null check to specific columns.

Checking for nulls across all columns might be inefficient for large datasets. Consider checking specific columns that are critical for your analysis to improve performance.

time_series.select(
    pl.col('date'),
    pl.any_horizontal(
        pl.col(['critical_column1', 'critical_column2', 'critical_column3']).is_null()
    ).alias('has_null')
)

}
],
"source": [
"time_series.filter(pl.col('date') == datetime.strptime('2012-10-02 13:00:00', '%Y-%m-%d %H:%M:%S'))"
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion (performance): Optimize date filtering by converting column type.

If date filtering is a frequent operation, consider converting the date column to a datetime type once and using direct comparisons to improve performance.

Suggested change
"time_series.filter(pl.col('date') == datetime.strptime('2012-10-02 13:00:00', '%Y-%m-%d %H:%M:%S'))"
time_series = time_series.with_column(pl.col('date').cast(pl.Datetime))
filtered_series = time_series.filter(pl.col('date') == pl.datetime(2012, 10, 2, 13, 0, 0))

"cell_type": "markdown",
"metadata": {},
"source": [
"From the plots, it seems reasonable to do a linear interpolation for both missing data points."
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion: Specify interpolation method and handle edge cases.

The suggestion to use linear interpolation is vague. Specify the method of interpolation and consider handling edge cases where interpolation might not be appropriate, such as at the boundaries of the dataset.

Suggested change
"From the plots, it seems reasonable to do a linear interpolation for both missing data points."
"From the plots, it seems reasonable to use linear interpolation for the missing data points in the middle of the dataset. For edge cases, such as missing values at the beginning or end of the time series, we should consider using forward or backward fill methods respectively. We'll need to carefully examine these boundary cases before applying the interpolation."

@@ -14,6 +14,7 @@ to create it. There is some material not covered in the presentation as well.
* [`networkx`](networkx): illustration of using the networkx library for graph
representation and algorithms.
* [`pandas`](pandas): illustrations of using pandas and seaborn.
* [`polars`](polars): Kllustrations of using polars.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue (documentation): Fix typo in 'Kllustrations'

The word 'Kllustrations' should be 'Illustrations'.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant