Add polars notebook #19

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Merged

gjbex merged 6 commits into master from development

Sep 3, 2024

Owner

gjbex commented Sep 3, 2024 •

edited by sourcery-ai bot

Loading

Summary by Sourcery

Add a new Polars section to the project, including a README and a Jupyter notebook that explores functional differences between pandas and Polars, replicating the existing pandas notebook.

New Features:

Introduce a new section in the project for Polars, an alternative to pandas, with a focus on performance improvements.

Documentation:

Add documentation for the new Polars section, including a README that explains its purpose and contents.

gjbex and others added 6 commits

September 2, 2024 19:48


          Add polars example

4ffa68b


          Remove hvplot attempt

c038bf9


          New Python version

25a5e95


          Add remaining sections, clean up formatting

88f1973


          Add README

6fd469a


          Merge pull request #18 from gjbex/feature/polars

76475ee

Add polars Jupyter notebook

review-notebook-app bot commented Sep 3, 2024

Check out this pull request on

See visual diffs & provide feedback on Jupyter Notebooks.

Powered by ReviewNB

sourcery-ai bot commented Sep 3, 2024 •

edited

Loading

Reviewer's Guide by Sourcery

This pull request adds a new section for Polars, a high-performance alternative to pandas, to the project. It includes updates to the main README.md file and introduces a new README.md file for the Polars section, explaining its purpose and contents.

File-Level Changes

Change	Details	Files
Added Polars section to the main project structure	Included Polars in the list of available libraries/tools Added a brief description of Polars as an illustration	`source-code/README.md`
Created a new README for the Polars section	Introduced Polars as an alternative to pandas with better performance Described the contents of the Polars section, including a Jupyter notebook and data directory Mentioned that the notebook replicates an existing pandas notebook for comparison	`source-code/polars/README.md`

Tips

Trigger a new Sourcery review by commenting @sourcery-ai review on the pull request.
Continue your discussion with Sourcery by replying directly to review comments.
You can change your review settings at any time by accessing your dashboard:
- Enable or disable the Sourcery-generated pull request summary or reviewer's guide;
- Change the review language;
You can always contact us if you have any questions or feedback.

gjbex merged commit 5a7f3ae into master

sourcery-ai bot reviewed

View reviewed changes

sourcery-ai bot left a comment

Hey @gjbex - I've reviewed your changes - here's some feedback:

Overall Comments:

There's a typo in the main README.md file. 'Kllustrations' should be 'Illustrations' in the Polars entry.

Here's what I looked at during the review

🟡 General issues: 5 issues found
🟢 Security: all looks good
🟢 Testing: all looks good
🟢 Complexity: all looks good
🟡 Documentation: 1 issue found

Sourcery is free for open source - if you like our reviews please consider sharing them ✨

_{Help me be more useful! Please click 👍 or 👎 on each comment to tell me if it was helpful.}

source-code/polars/patient_data.ipynb

Comment on lines +565 to +568

    
                  "for patient in range(1, 10):\n",

                  "    plt.plot(time_series['date'], time_series[f'temperature_{patient}'],\n",

                  "             label=f'patient {patient}')\n",

                  "plt.legend();"

sourcery-ai bot Sep 3, 2024

suggestion: Consider using a dynamic range for patient iteration.

The loop iterates over a hardcoded range from 1 to 10. Consider using a dynamic range based on the actual number of patients in the dataset to make the code more flexible.

Suggested change

      
                "for patient in range(1, 10):\n",
          
                "    plt.plot(time_series['date'], time_series[f'temperature_{patient}'],\n",
          
                "             label=f'patient {patient}')\n",
          
                "plt.legend();"
          
            num_patients = len([col for col in time_series.columns if col.startswith('temperature_')])
          
            for patient in range(1, num_patients + 1):
          
                plt.plot(time_series['date'], time_series[f'temperature_{patient}'],
          
                         label=f'patient {patient}')
          
            plt.legend()

source-code/polars/patient_data.ipynb

    
                  "for patient in range(1, 10):\n",

                  "    plt.plot(time_series['date'], time_series[f'temperature_{patient}'],\n",

                  "             label=f'patient {patient}')\n",

                  "plt.legend();"

sourcery-ai bot Sep 3, 2024

suggestion (bug_risk): Check for key existence in dictionary access.

When accessing dictionary keys using formatted strings, ensure that the keys exist to avoid potential KeyErrors. Consider using a safer method like dict.get() with a default value.

Suggested change

      
                "plt.legend();"
          
            for patient in range(1, 10):
          
                plt.plot(time_series['date'], time_series.get(f'temperature_{patient}', []),
          
                         label=f'patient {patient}')
          
            plt.legend()

source-code/polars/patient_data.ipynb

    
                  }

                 ],

                 "source": [

                  "time_series.select(pl.col('date'), pl.any_horizontal(pl.all().is_null()).alias('has_null'))"

sourcery-ai bot Sep 3, 2024

suggestion (performance): Optimize null check to specific columns.

Checking for nulls across all columns might be inefficient for large datasets. Consider checking specific columns that are critical for your analysis to improve performance.

time_series.select(
    pl.col('date'),
    pl.any_horizontal(
        pl.col(['critical_column1', 'critical_column2', 'critical_column3']).is_null()
    ).alias('has_null')
)

source-code/polars/patient_data.ipynb

    
                  }

                 ],

                 "source": [

                  "time_series.filter(pl.col('date') == datetime.strptime('2012-10-02 13:00:00', '%Y-%m-%d %H:%M:%S'))"

sourcery-ai bot Sep 3, 2024

suggestion (performance): Optimize date filtering by converting column type.

If date filtering is a frequent operation, consider converting the date column to a datetime type once and using direct comparisons to improve performance.

Suggested change

      
                "time_series.filter(pl.col('date') == datetime.strptime('2012-10-02 13:00:00', '%Y-%m-%d %H:%M:%S'))"
          
            time_series = time_series.with_column(pl.col('date').cast(pl.Datetime))
          
            filtered_series = time_series.filter(pl.col('date') == pl.datetime(2012, 10, 2, 13, 0, 0))

source-code/polars/patient_data.ipynb

    
                 "cell_type": "markdown",

                 "metadata": {},

                 "source": [

                  "From the plots, it seems reasonable to do a linear interpolation for both missing data points."

sourcery-ai bot Sep 3, 2024

suggestion: Specify interpolation method and handle edge cases.

The suggestion to use linear interpolation is vague. Specify the method of interpolation and consider handling edge cases where interpolation might not be appropriate, such as at the boundaries of the dataset.

Suggested change

      
                "From the plots, it seems reasonable to do a linear interpolation for both missing data points."
          
            "From the plots, it seems reasonable to use linear interpolation for the missing data points in the middle of the dataset. For edge cases, such as missing values at the beginning or end of the time series, we should consider using forward or backward fill methods respectively. We'll need to carefully examine these boundary cases before applying the interpolation."

source-code/README.md

               * [`networkx`](networkx): illustration of using the networkx library for graph
                 representation and algorithms.
               * [`pandas`](pandas): illustrations of using pandas and seaborn.
+              * [`polars`](polars): Kllustrations of using polars.

sourcery-ai bot Sep 3, 2024

issue (documentation): Fix typo in 'Kllustrations'

The word 'Kllustrations' should be 'Illustrations'.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet