-
Notifications
You must be signed in to change notification settings - Fork 20
Add polars #20
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add polars #20
Conversation
Check out this pull request on See visual diffs & provide feedback on Jupyter Notebooks. Powered by ReviewNB |
Reviewer's Guide by SourceryThis pull request adds support for the Polars library, introducing new files for performance benchmarking and comparison with Pandas. The changes include new Jupyter notebooks, Python scripts for data generation, and updates to existing README files. File-Level Changes
Tips
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @gjbex - I've reviewed your changes and found some issues that need to be addressed.
Blocking issues:
- Hard-coded email address found in SLURM script. (link)
Overall Comments:
- Consider consolidating the CSV generation scripts in pandas and polars directories into a shared utility to avoid duplication.
- It would be beneficial to add instructions on how to run the benchmarks in a controlled environment to ensure reproducibility.
Here's what I looked at during the review
- 🟡 General issues: 1 issue found
- 🔴 Security: 1 blocking issue
- 🟢 Testing: all looks good
- 🟢 Complexity: all looks good
- 🟡 Documentation: 1 issue found
Help me be more useful! Please click 👍 or 👎 on each comment to tell me if it was helpful.
import sys | ||
|
||
|
||
def generate_csv_file(args): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
suggestion: Enhance input validation and error handling
Add more comprehensive input validation for the function arguments, and include error handling for potential issues like file writing errors. This will make the script more robust and user-friendly.
def generate_csv_file(args): | |
def generate_csv_file(args): | |
if not isinstance(args, dict): | |
raise ValueError("args must be a dictionary") | |
required_keys = ['output_file', 'num_rows', 'columns'] | |
if not all(key in args for key in required_keys): | |
raise ValueError(f"Missing required arguments: {', '.join(required_keys)}") | |
try: | |
with open(args['output_file'], 'w', newline='') as csvfile: | |
# Rest of the function implementation |
1. `polars_large_data_benchmark.ipynb`: Jupyter notebook that compares the | ||
performance of polars and pandas on large data sets. | ||
1. `create_csv_data.py`: Python script to generate one or more large CSV files | ||
for benchmarking. | ||
1. `create_csv_data.slurm`: Slurm script to run `create_csv_data.py` on a | ||
cluster. | ||
1. `polars_performance.ipynb`: Jupyter notebook that compares the performance | ||
of polars and pandas on a variety of operations. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
suggestion (documentation): Consider adjusting indentation for consistency
The indentation of the new entries doesn't match the existing ones. Consider aligning them for better readability and consistency.
1. `polars_large_data_benchmark.ipynb`: Jupyter notebook that compares the | |
performance of polars and pandas on large data sets. | |
1. `create_csv_data.py`: Python script to generate one or more large CSV files | |
for benchmarking. | |
1. `create_csv_data.slurm`: Slurm script to run `create_csv_data.py` on a | |
cluster. | |
1. `polars_performance.ipynb`: Jupyter notebook that compares the performance | |
of polars and pandas on a variety of operations. | |
1. `polars_large_data_benchmark.ipynb`: Jupyter notebook that compares the | |
performance of polars and pandas on large data sets. | |
1. `create_csv_data.py`: Python script to generate one or more large CSV files | |
for benchmarking. | |
1. `create_csv_data.slurm`: Slurm script to run `create_csv_data.py` on a | |
cluster. | |
1. `polars_performance.ipynb`: Jupyter notebook that compares the performance | |
of polars and pandas on a variety of operations. |
#SBATCH --cpus-per-task=1 | ||
#SBATCH --mem=2G | ||
#SBATCH --time=01:00:00 | ||
#SBATCH --mail-user=geertjan.bex@uhasselt.be |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🚨 issue (security): Hard-coded email address found in SLURM script.
Consider using an environment variable or configuration file to manage email addresses to avoid exposing personal information in the codebase.
|
||
def generate_csv_file(args): | ||
# Set end-of-line character | ||
if args.file_type == 'Windows': |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
issue (code-quality): Simplify conditional into switch-like form [×3] (switch
)
fieldnames = ['timestamp'] | ||
fieldnames.extend([f'C{i + 1:d}' for i in range(len(curr_vals))]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
suggestion (code-quality): Merge extend into list declaration (merge-list-extend
)
fieldnames = ['timestamp'] | |
fieldnames.extend([f'C{i + 1:d}' for i in range(len(curr_vals))]) | |
fieldnames = ['timestamp', *[f'C{i + 1:d}' for i in range(len(curr_vals))]] |
As polars is becoming popular as an alternative to pandas, it is useful to understand its strengths and weaknesses.
Summary by Sourcery
Add new resources for comparing Polars and Pandas, including performance and functional differences. Introduce scripts for generating large datasets and running benchmarks, and update documentation to reflect these additions.
New Features:
Enhancements:
Documentation: