-
Notifications
You must be signed in to change notification settings - Fork 20
Add polars notebook #19
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Check out this pull request on See visual diffs & provide feedback on Jupyter Notebooks. Powered by ReviewNB |
Reviewer's Guide by SourceryThis pull request adds a new section for Polars, a high-performance alternative to pandas, to the project. It includes updates to the main README.md file and introduces a new README.md file for the Polars section, explaining its purpose and contents. File-Level Changes
Tips
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @gjbex - I've reviewed your changes - here's some feedback:
Overall Comments:
- There's a typo in the main README.md file. 'Kllustrations' should be 'Illustrations' in the Polars entry.
Here's what I looked at during the review
- 🟡 General issues: 5 issues found
- 🟢 Security: all looks good
- 🟢 Testing: all looks good
- 🟢 Complexity: all looks good
- 🟡 Documentation: 1 issue found
Help me be more useful! Please click 👍 or 👎 on each comment to tell me if it was helpful.
"for patient in range(1, 10):\n", | ||
" plt.plot(time_series['date'], time_series[f'temperature_{patient}'],\n", | ||
" label=f'patient {patient}')\n", | ||
"plt.legend();" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
suggestion: Consider using a dynamic range for patient iteration.
The loop iterates over a hardcoded range from 1 to 10. Consider using a dynamic range based on the actual number of patients in the dataset to make the code more flexible.
"for patient in range(1, 10):\n", | |
" plt.plot(time_series['date'], time_series[f'temperature_{patient}'],\n", | |
" label=f'patient {patient}')\n", | |
"plt.legend();" | |
num_patients = len([col for col in time_series.columns if col.startswith('temperature_')]) | |
for patient in range(1, num_patients + 1): | |
plt.plot(time_series['date'], time_series[f'temperature_{patient}'], | |
label=f'patient {patient}') | |
plt.legend() |
"for patient in range(1, 10):\n", | ||
" plt.plot(time_series['date'], time_series[f'temperature_{patient}'],\n", | ||
" label=f'patient {patient}')\n", | ||
"plt.legend();" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
suggestion (bug_risk): Check for key existence in dictionary access.
When accessing dictionary keys using formatted strings, ensure that the keys exist to avoid potential KeyErrors. Consider using a safer method like dict.get()
with a default value.
"plt.legend();" | |
for patient in range(1, 10): | |
plt.plot(time_series['date'], time_series.get(f'temperature_{patient}', []), | |
label=f'patient {patient}') | |
plt.legend() |
} | ||
], | ||
"source": [ | ||
"time_series.select(pl.col('date'), pl.any_horizontal(pl.all().is_null()).alias('has_null'))" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
suggestion (performance): Optimize null check to specific columns.
Checking for nulls across all columns might be inefficient for large datasets. Consider checking specific columns that are critical for your analysis to improve performance.
time_series.select(
pl.col('date'),
pl.any_horizontal(
pl.col(['critical_column1', 'critical_column2', 'critical_column3']).is_null()
).alias('has_null')
)
} | ||
], | ||
"source": [ | ||
"time_series.filter(pl.col('date') == datetime.strptime('2012-10-02 13:00:00', '%Y-%m-%d %H:%M:%S'))" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
suggestion (performance): Optimize date filtering by converting column type.
If date filtering is a frequent operation, consider converting the date column to a datetime type once and using direct comparisons to improve performance.
"time_series.filter(pl.col('date') == datetime.strptime('2012-10-02 13:00:00', '%Y-%m-%d %H:%M:%S'))" | |
time_series = time_series.with_column(pl.col('date').cast(pl.Datetime)) | |
filtered_series = time_series.filter(pl.col('date') == pl.datetime(2012, 10, 2, 13, 0, 0)) |
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"From the plots, it seems reasonable to do a linear interpolation for both missing data points." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
suggestion: Specify interpolation method and handle edge cases.
The suggestion to use linear interpolation is vague. Specify the method of interpolation and consider handling edge cases where interpolation might not be appropriate, such as at the boundaries of the dataset.
"From the plots, it seems reasonable to do a linear interpolation for both missing data points." | |
"From the plots, it seems reasonable to use linear interpolation for the missing data points in the middle of the dataset. For edge cases, such as missing values at the beginning or end of the time series, we should consider using forward or backward fill methods respectively. We'll need to carefully examine these boundary cases before applying the interpolation." |
@@ -14,6 +14,7 @@ to create it. There is some material not covered in the presentation as well. | |||
* [`networkx`](networkx): illustration of using the networkx library for graph | |||
representation and algorithms. | |||
* [`pandas`](pandas): illustrations of using pandas and seaborn. | |||
* [`polars`](polars): Kllustrations of using polars. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
issue (documentation): Fix typo in 'Kllustrations'
The word 'Kllustrations' should be 'Illustrations'.
Summary by Sourcery
Add a new Polars section to the project, including a README and a Jupyter notebook that explores functional differences between pandas and Polars, replicating the existing pandas notebook.
New Features:
Documentation: