Add polars #20

gjbex · 2024-09-11T08:52:31Z

As polars is becoming popular as an alternative to pandas, it is useful to understand its strengths and weaknesses.

Feature comparison on patient data example.
Performance comparison

Summary by Sourcery

Add new resources for comparing Polars and Pandas, including performance and functional differences. Introduce scripts for generating large datasets and running benchmarks, and update documentation to reflect these additions.

New Features:

Introduce a Jupyter notebook for performance comparison between Polars and Pandas on large datasets.
Add a Python script to generate large CSV files for benchmarking purposes.
Include a Slurm script to facilitate running the CSV generation script on a cluster.

Enhancements:

Add a Jupyter notebook to explore functional differences between Polars and Pandas using patient data.

Documentation:

Update README files to include information about new notebooks and scripts related to Polars and Pandas.

review-notebook-app · 2024-09-11T08:52:35Z

Check out this pull request on

See visual diffs & provide feedback on Jupyter Notebooks.

Powered by ReviewNB

sourcery-ai · 2024-09-11T08:52:37Z

Reviewer's Guide by Sourcery

This pull request adds support for the Polars library, introducing new files for performance benchmarking and comparison with Pandas. The changes include new Jupyter notebooks, Python scripts for data generation, and updates to existing README files.

File-Level Changes

Change	Details	Files
Added Polars performance benchmarking notebooks and scripts	Created polars_performance.ipynb to compare Polars and Pandas performance Added polars_large_data_benchmark.ipynb for benchmarking on large datasets Implemented create_csv_data.py script to generate large CSV files for benchmarking Added create_csv_data.slurm script for running data generation on a cluster	`source-code/polars/README.md` `source-code/polars/polars_performance.ipynb` `source-code/polars/polars_large_data_benchmark.ipynb` `source-code/polars/create_csv_data.py`
Updated README files with new content descriptions	Added descriptions for new Polars-related files in polars/README.md Updated pandas/README.md with information about generate_csv_files.py	`source-code/polars/README.md` `source-code/pandas/README.md`
Added CSV file generation script for Pandas	Implemented generate_csv_files.py script for creating CSV files with various options Added support for different column types, file formats, and customization options	`source-code/pandas/generate_csv_files.py`

Tips

Trigger a new Sourcery review by commenting @sourcery-ai review on the pull request.
Continue your discussion with Sourcery by replying directly to review comments.
You can change your review settings at any time by accessing your dashboard:
- Enable or disable the Sourcery-generated pull request summary or reviewer's guide;
- Change the review language;
You can always contact us if you have any questions or feedback.

sourcery-ai

Hey @gjbex - I've reviewed your changes and found some issues that need to be addressed.

Blocking issues:

Hard-coded email address found in SLURM script. (link)

Overall Comments:

Consider consolidating the CSV generation scripts in pandas and polars directories into a shared utility to avoid duplication.
It would be beneficial to add instructions on how to run the benchmarks in a controlled environment to ensure reproducibility.

Here's what I looked at during the review

🟡 General issues: 1 issue found
🔴 Security: 1 blocking issue
🟢 Testing: all looks good
🟢 Complexity: all looks good
🟡 Documentation: 1 issue found

Sourcery is free for open source - if you like our reviews please consider sharing them ✨

_{Help me be more useful! Please click 👍 or 👎 on each comment to tell me if it was helpful.}

sourcery-ai · 2024-09-11T08:54:15Z

source-code/pandas/generate_csv_files.py

+import sys
+
+
+def generate_csv_file(args):


suggestion: Enhance input validation and error handling

Add more comprehensive input validation for the function arguments, and include error handling for potential issues like file writing errors. This will make the script more robust and user-friendly.

Suggested change

def generate_csv_file(args):

def generate_csv_file(args):

if not isinstance(args, dict):

raise ValueError("args must be a dictionary")

required_keys = ['output_file', 'num_rows', 'columns']

if not all(key in args for key in required_keys):

raise ValueError(f"Missing required arguments: {', '.join(required_keys)}")

try:

with open(args['output_file'], 'w', newline='') as csvfile:

# Rest of the function implementation

sourcery-ai · 2024-09-11T08:54:15Z

source-code/polars/README.md

+1. `polars_large_data_benchmark.ipynb`: Jupyter notebook that compares the
+   performance of polars and pandas on large data sets.
+1. `create_csv_data.py`: Python script to generate one or more large CSV files
+   for benchmarking.
+1. `create_csv_data.slurm`: Slurm script to run `create_csv_data.py` on a
+   cluster.
+1. `polars_performance.ipynb`: Jupyter notebook that compares the performance
+   of polars and pandas on a variety of operations.


suggestion (documentation): Consider adjusting indentation for consistency

The indentation of the new entries doesn't match the existing ones. Consider aligning them for better readability and consistency.

Suggested change

1. `polars_large_data_benchmark.ipynb`: Jupyter notebook that compares the

performance of polars and pandas on large data sets.

1. `create_csv_data.py`: Python script to generate one or more large CSV files

for benchmarking.

1. `create_csv_data.slurm`: Slurm script to run `create_csv_data.py` on a

cluster.

1. `polars_performance.ipynb`: Jupyter notebook that compares the performance

of polars and pandas on a variety of operations.

1. `polars_large_data_benchmark.ipynb`: Jupyter notebook that compares the

performance of polars and pandas on large data sets.

1. `create_csv_data.py`: Python script to generate one or more large CSV files

for benchmarking.

1. `create_csv_data.slurm`: Slurm script to run `create_csv_data.py` on a

cluster.

1. `polars_performance.ipynb`: Jupyter notebook that compares the performance

of polars and pandas on a variety of operations.

sourcery-ai · 2024-09-11T08:54:15Z

source-code/polars/create_csv_data.slurm

+#SBATCH --cpus-per-task=1
+#SBATCH --mem=2G
+#SBATCH --time=01:00:00
+#SBATCH --mail-user=geertjan.bex@uhasselt.be


🚨 issue (security): Hard-coded email address found in SLURM script.

Consider using an environment variable or configuration file to manage email addresses to avoid exposing personal information in the codebase.

sourcery-ai · 2024-09-11T08:54:15Z

source-code/pandas/generate_csv_files.py

+
+def generate_csv_file(args):
+    # Set end-of-line character
+    if args.file_type == 'Windows':


issue (code-quality): Simplify conditional into switch-like form [×3] (switch)

sourcery-ai · 2024-09-11T08:54:16Z

source-code/polars/create_csv_data.py

+    fieldnames = ['timestamp']
+    fieldnames.extend([f'C{i + 1:d}' for i in range(len(curr_vals))])


suggestion (code-quality): Merge extend into list declaration (merge-list-extend)

Suggested change

fieldnames = ['timestamp']

fieldnames.extend([f'C{i + 1:d}' for i in range(len(curr_vals))])

fieldnames = ['timestamp', *[f'C{i + 1:d}' for i in range(len(curr_vals))]]

gjbex added 10 commits September 8, 2024 07:02

Add script to generate large CSV files for performance tests

73ce81f

Modernize code

acf4f89

Add CSV test files and geneaator

ed5070f

Add pandas versus polars benchmark

434ba32

Add generation script for large CSV files

b7f205b

Add cluster

3b68093

Add Slurm job output files

9b9ae90

Add description and conclusions

a972a99

Rename to reflect purpose of notebook

8493226

Add polars versus pandas performance benchmarks

2917a29

gjbex merged commit a95e12f into master Sep 11, 2024

sourcery-ai bot reviewed Sep 11, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add polars #20

Add polars #20

Uh oh!

gjbex commented Sep 11, 2024 •

edited by sourcery-ai bot

Loading

Uh oh!

review-notebook-app bot commented Sep 11, 2024

Uh oh!

sourcery-ai bot commented Sep 11, 2024 •

edited

Loading

Uh oh!

sourcery-ai bot left a comment

Uh oh!

sourcery-ai bot Sep 11, 2024

Uh oh!

sourcery-ai bot Sep 11, 2024

Uh oh!

sourcery-ai bot Sep 11, 2024

Uh oh!

sourcery-ai bot Sep 11, 2024

Uh oh!

sourcery-ai bot Sep 11, 2024

Uh oh!

Uh oh!

-def generate_csv_file(args):
+def generate_csv_file(args):
+    if not isinstance(args, dict):
+        raise ValueError("args must be a dictionary")
+    required_keys = ['output_file', 'num_rows', 'columns']
+    if not all(key in args for key in required_keys):
+        raise ValueError(f"Missing required arguments: {', '.join(required_keys)}")
+    try:
+        with open(args['output_file'], 'w', newline='') as csvfile:
+            # Rest of the function implementation

		fieldnames = ['timestamp']
		fieldnames.extend([f'C{i + 1:d}' for i in range(len(curr_vals))])

	fieldnames = ['timestamp']
	fieldnames.extend([f'C{i + 1:d}' for i in range(len(curr_vals))])
	fieldnames = ['timestamp', *[f'C{i + 1:d}' for i in range(len(curr_vals))]]

Add polars #20

Add polars #20

Uh oh!

Conversation

gjbex commented Sep 11, 2024 • edited by sourcery-ai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by Sourcery

Uh oh!

review-notebook-app bot commented Sep 11, 2024

Uh oh!

sourcery-ai bot commented Sep 11, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviewer's Guide by Sourcery

File-Level Changes

Uh oh!

sourcery-ai bot left a comment

Choose a reason for hiding this comment

Uh oh!

sourcery-ai bot Sep 11, 2024

Choose a reason for hiding this comment

Uh oh!

sourcery-ai bot Sep 11, 2024

Choose a reason for hiding this comment

Uh oh!

sourcery-ai bot Sep 11, 2024

Choose a reason for hiding this comment

Uh oh!

sourcery-ai bot Sep 11, 2024

Choose a reason for hiding this comment

Uh oh!

sourcery-ai bot Sep 11, 2024

Choose a reason for hiding this comment

Uh oh!

Uh oh!

gjbex commented Sep 11, 2024 •

edited by sourcery-ai bot

Loading

sourcery-ai bot commented Sep 11, 2024 •

edited

Loading