Skip to content

Add polars #20

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 10 commits into from
Sep 11, 2024
Merged

Add polars #20

merged 10 commits into from
Sep 11, 2024

Conversation

gjbex
Copy link
Owner

@gjbex gjbex commented Sep 11, 2024

As polars is becoming popular as an alternative to pandas, it is useful to understand its strengths and weaknesses.

  • Feature comparison on patient data example.
  • Performance comparison

Summary by Sourcery

Add new resources for comparing Polars and Pandas, including performance and functional differences. Introduce scripts for generating large datasets and running benchmarks, and update documentation to reflect these additions.

New Features:

  • Introduce a Jupyter notebook for performance comparison between Polars and Pandas on large datasets.
  • Add a Python script to generate large CSV files for benchmarking purposes.
  • Include a Slurm script to facilitate running the CSV generation script on a cluster.

Enhancements:

  • Add a Jupyter notebook to explore functional differences between Polars and Pandas using patient data.

Documentation:

  • Update README files to include information about new notebooks and scripts related to Polars and Pandas.

Copy link

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

Copy link

sourcery-ai bot commented Sep 11, 2024

Reviewer's Guide by Sourcery

This pull request adds support for the Polars library, introducing new files for performance benchmarking and comparison with Pandas. The changes include new Jupyter notebooks, Python scripts for data generation, and updates to existing README files.

File-Level Changes

Change Details Files
Added Polars performance benchmarking notebooks and scripts
  • Created polars_performance.ipynb to compare Polars and Pandas performance
  • Added polars_large_data_benchmark.ipynb for benchmarking on large datasets
  • Implemented create_csv_data.py script to generate large CSV files for benchmarking
  • Added create_csv_data.slurm script for running data generation on a cluster
source-code/polars/README.md
source-code/polars/polars_performance.ipynb
source-code/polars/polars_large_data_benchmark.ipynb
source-code/polars/create_csv_data.py
Updated README files with new content descriptions
  • Added descriptions for new Polars-related files in polars/README.md
  • Updated pandas/README.md with information about generate_csv_files.py
source-code/polars/README.md
source-code/pandas/README.md
Added CSV file generation script for Pandas
  • Implemented generate_csv_files.py script for creating CSV files with various options
  • Added support for different column types, file formats, and customization options
source-code/pandas/generate_csv_files.py

Tips
  • Trigger a new Sourcery review by commenting @sourcery-ai review on the pull request.
  • Continue your discussion with Sourcery by replying directly to review comments.
  • You can change your review settings at any time by accessing your dashboard:
    • Enable or disable the Sourcery-generated pull request summary or reviewer's guide;
    • Change the review language;
  • You can always contact us if you have any questions or feedback.

@gjbex gjbex merged commit a95e12f into master Sep 11, 2024
Copy link

@sourcery-ai sourcery-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @gjbex - I've reviewed your changes and found some issues that need to be addressed.

Blocking issues:

  • Hard-coded email address found in SLURM script. (link)

Overall Comments:

  • Consider consolidating the CSV generation scripts in pandas and polars directories into a shared utility to avoid duplication.
  • It would be beneficial to add instructions on how to run the benchmarks in a controlled environment to ensure reproducibility.
Here's what I looked at during the review
  • 🟡 General issues: 1 issue found
  • 🔴 Security: 1 blocking issue
  • 🟢 Testing: all looks good
  • 🟢 Complexity: all looks good
  • 🟡 Documentation: 1 issue found

Sourcery is free for open source - if you like our reviews please consider sharing them ✨
Help me be more useful! Please click 👍 or 👎 on each comment to tell me if it was helpful.

import sys


def generate_csv_file(args):
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion: Enhance input validation and error handling

Add more comprehensive input validation for the function arguments, and include error handling for potential issues like file writing errors. This will make the script more robust and user-friendly.

Suggested change
def generate_csv_file(args):
def generate_csv_file(args):
if not isinstance(args, dict):
raise ValueError("args must be a dictionary")
required_keys = ['output_file', 'num_rows', 'columns']
if not all(key in args for key in required_keys):
raise ValueError(f"Missing required arguments: {', '.join(required_keys)}")
try:
with open(args['output_file'], 'w', newline='') as csvfile:
# Rest of the function implementation

Comment on lines +11 to +18
1. `polars_large_data_benchmark.ipynb`: Jupyter notebook that compares the
performance of polars and pandas on large data sets.
1. `create_csv_data.py`: Python script to generate one or more large CSV files
for benchmarking.
1. `create_csv_data.slurm`: Slurm script to run `create_csv_data.py` on a
cluster.
1. `polars_performance.ipynb`: Jupyter notebook that compares the performance
of polars and pandas on a variety of operations.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion (documentation): Consider adjusting indentation for consistency

The indentation of the new entries doesn't match the existing ones. Consider aligning them for better readability and consistency.

Suggested change
1. `polars_large_data_benchmark.ipynb`: Jupyter notebook that compares the
performance of polars and pandas on large data sets.
1. `create_csv_data.py`: Python script to generate one or more large CSV files
for benchmarking.
1. `create_csv_data.slurm`: Slurm script to run `create_csv_data.py` on a
cluster.
1. `polars_performance.ipynb`: Jupyter notebook that compares the performance
of polars and pandas on a variety of operations.
1. `polars_large_data_benchmark.ipynb`: Jupyter notebook that compares the
performance of polars and pandas on large data sets.
1. `create_csv_data.py`: Python script to generate one or more large CSV files
for benchmarking.
1. `create_csv_data.slurm`: Slurm script to run `create_csv_data.py` on a
cluster.
1. `polars_performance.ipynb`: Jupyter notebook that compares the performance
of polars and pandas on a variety of operations.

#SBATCH --cpus-per-task=1
#SBATCH --mem=2G
#SBATCH --time=01:00:00
#SBATCH --mail-user=geertjan.bex@uhasselt.be
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🚨 issue (security): Hard-coded email address found in SLURM script.

Consider using an environment variable or configuration file to manage email addresses to avoid exposing personal information in the codebase.


def generate_csv_file(args):
# Set end-of-line character
if args.file_type == 'Windows':
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue (code-quality): Simplify conditional into switch-like form [×3] (switch)

Comment on lines +13 to +14
fieldnames = ['timestamp']
fieldnames.extend([f'C{i + 1:d}' for i in range(len(curr_vals))])
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion (code-quality): Merge extend into list declaration (merge-list-extend)

Suggested change
fieldnames = ['timestamp']
fieldnames.extend([f'C{i + 1:d}' for i in range(len(curr_vals))])
fieldnames = ['timestamp', *[f'C{i + 1:d}' for i in range(len(curr_vals))]]

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant