Pulse · Lightning-AI/pytorch-lightning · GitHub

July 4, 2025 – August 4, 2025

Overview

34 Active pull requests

31 Active issues

24 Pull requests merged by 9 people

build(deps): bump Lightning-AI/utilities from 0.14.3 to 0.15.0
#21010 merged Aug 4, 2025
fix ci: progress bar console clearing for latest Rich release
#21016 merged Aug 4, 2025
build(deps): update torchmetrics requirement from <1.8.0,>=0.10.0 to >=0.10.0,<1.9.0 in /requirements
#21006 merged Aug 4, 2025
build(deps): update lightning-utilities requirement from <0.15.0,>=0.11.1 to >=0.11.1,<0.16.0 in /requirements
#21005 merged Aug 4, 2025
build(deps): update docutils requirement from <=0.19,>=0.18.1 to >=0.18.1,<=0.22 in /requirements
#21027 merged Aug 4, 2025
build(deps): update awscli requirement from <1.42.0,>=1.30.0 to >=1.30.0,<1.43.0 in /requirements
#21026 merged Aug 4, 2025
build(deps): bump mypy from 1.17.0 to 1.17.1 in /requirements
#21025 merged Aug 4, 2025
build(deps): bump coverage from 7.9.2 to 7.10.2 in /requirements
#21024 merged Aug 4, 2025
fix: rich progress bar error when resume training
#21000 merged Jul 31, 2025
fix broken links to studios
#21014 merged Jul 28, 2025
fix CI: awscli docutils version conflict
#20997 merged Jul 26, 2025
docs: updating flaking links
#20980 merged Jul 23, 2025
build(deps): update tensorboard requirement from <2.20.0,>=2.9.1 to >=2.9.1,<2.21.0 in /requirements
#20992 merged Jul 22, 2025
build(deps): bump mypy from 1.16.1 to 1.17.0 in /requirements
#20991 merged Jul 22, 2025
build(deps): update fsspec[http] requirement from <2025.6.0,>=2022.5.0 to >=2022.5.0,<2025.8.0 in /requirements
#20990 merged Jul 22, 2025
Allow dataloader_idx_ in log names when add_dataloader_idx=False
#20987 merged Jul 18, 2025
fix: failing markdown link test in ci
#20979 merged Jul 14, 2025
docs: update ref to latest tutorials
#20977 merged Jul 14, 2025
Add support nvcr.io/nvidia/pytorch:25.06-py3
#20971 merged Jul 11, 2025
Model checkpointing save_on_train_epoch_end default behavior documentation
#20931 merged Jul 9, 2025
Fix: Allow trainer to accept CUDAAccelerator instance as accelerator with FSDP strategy
#20964 merged Jul 9, 2025
Add dev env setup guide
#20961 merged Jul 9, 2025
build(deps): bump coverage from 7.9.1 to 7.9.2 in /requirements
#20966 merged Jul 7, 2025
build(deps): update awscli requirement from <1.41.0,>=1.30.0 to >=1.30.0,<1.42.0 in /requirements
#20965 merged Jul 7, 2025

10 Pull requests opened by 6 people

[pre-commit.ci] pre-commit suggestions
#20968 opened Jul 7, 2025
fix: remove extra parameter in accelerator registry decorator
#20975 opened Jul 11, 2025
Fix MLFlowLogger.save_dir Windows file URI handling (Fixes #20972)
#20988 opened Jul 19, 2025
nitpick: add make command to quickly setup the project on `lightning studio`
#20996 opened Jul 23, 2025
docker: simplify the docker name with CUDA
#21001 opened Jul 25, 2025
add/debug Lit CI [wip]
#21002 opened Jul 25, 2025
docs: update mail to developer@lightning.ai
#21003 opened Jul 26, 2025
Allow `training_step` in manual optimization to return general mappings
#21011 opened Jul 28, 2025
Sync dist clarification and consistency
#21012 opened Jul 28, 2025
Fix fabric examples and load_checkpoint hparams ref
#21013 opened Jul 28, 2025

15 Issues closed by 4 people

bugs too many
#20875 closed Aug 4, 2025
Documentation or main page is not loading [not available in your region]
#20989 closed Aug 1, 2025
Rich progress_bar_id is None if restore training state from a step checkpoint
#21015 closed Jul 31, 2025
Missing jsonargparse as dependency
#21018 closed Jul 31, 2025
Rich progress bar error when resume training
#20976 closed Jul 31, 2025
Spend a lot of time to load large ckpt
#21017 closed Jul 31, 2025
on_validation_batch_end is not called when Loss is NaN
#20999 closed Jul 30, 2025
Allow user to use `dataloader_idx` in log name in `LightningModule.log`
#20485 closed Jul 18, 2025
`validation_epoch_end` is still mentionned in the documentation with version >= 2.0.0 while it has been removed from the code
#20559 closed Jul 17, 2025
`load_from_checkpoint` returns `None`
#20607 closed Jul 17, 2025
Lightning is requiring packaging < 25.0
#20772 closed Jul 14, 2025
`ModelCheckpoint`'s argument `save_on_train_epoch_end`'s documentation unclear when value is `None`
#20781 closed Jul 9, 2025
Strategy `fsdp` requires a GPU accelerator, but got CUDAAccelerator
#20957 closed Jul 9, 2025
Recommend dev setup / support uv
#20954 closed Jul 9, 2025
lightning throws an exception on MacOS when the pytorch default device is set
#20696 closed Jul 7, 2025

16 Issues opened by 15 people

DDP Strategy Does Not Automatically Shard Batch Sizes Despite Documentation Claims
#21023 opened Aug 4, 2025
Trainer parameter limit-train-batches was meant to be per-worker
#21022 opened Aug 3, 2025
The difference of Trainer.test with ddp strategy
#21004 opened Jul 27, 2025
Remove an unnecessary TODO in `src/lightning/pytorch/loops/fit_loop.py`
#20998 opened Jul 24, 2025
Changing `on_step` in `self.log` causes `batch_to_device` to change
#20995 opened Jul 23, 2025
Support BatchSizeFinder in DDP
#20994 opened Jul 22, 2025
Accept `TensorDict` (or more generally, dict-like's) as a `training_step` return type
#20993 opened Jul 22, 2025
When the model is in an eval state before calling trainer.fit it should be moved to train state
#20986 opened Jul 18, 2025
Doing full validation on step 0
#20985 opened Jul 17, 2025
uv for CI
#20984 opened Jul 16, 2025
MoE (mixture of experts) support for expert parallel
#20982 opened Jul 15, 2025
Accelerator registry decorator usage fails with TypeError due to incorrect function signature
#20973 opened Jul 11, 2025
MLFlowLogger.save_dir mishandles absolute file: URIs on Windows
#20972 opened Jul 10, 2025
Proper way to use mixed precision with manual optimization
#20970 opened Jul 9, 2025
Recommend uv commands for development scripts
#20969 opened Jul 9, 2025
Improve Fault Tolerance via TorchFT
#20967 opened Jul 7, 2025

72 Unresolved conversations

Sometimes conversations happen on old items that aren’t yet closed. Here is a list of all the Issues and Pull Requests with unresolved conversations.

Fix double iteration bug when resumed from a checkpoint.
#20775 commented on Aug 4, 2025 • 6 new comments
Generic weight averaging callback that supports EMA
#20545 commented on Jul 17, 2025 • 1 new comment
Added warmup parameter to early stopping cb
#20778 commented on Jul 19, 2025 • 0 new comments
[Not finished] Allow customized parameter grouping for automatic optimzier configuration.
#20742 commented on Jul 19, 2025 • 0 new comments
fix: `overfit_batches` uses same batch for train and val
#20731 commented on Jul 19, 2025 • 0 new comments
Cpu memory accumulation bug
#20730 commented on Jul 19, 2025 • 0 new comments
docs(LightningModule): update docs for `.training` mode in loops
#20716 commented on Jul 19, 2025 • 0 new comments
FabricModule: wrap forward methods instead of monkeypatch-based redirect
#20711 commented on Aug 4, 2025 • 0 new comments
Added support for flushing Comet experiment data to the Comet after saving a checkpoint.
#20680 commented on Jul 7, 2025 • 0 new comments
feat[logger] update mlflow limit for parameters length log
#20636 commented on Jul 16, 2025 • 0 new comments
fix(mlflow): Enabling multiple callbacks for checkpoint reporting
#20585 commented on Jul 19, 2025 • 0 new comments
updated `ModelCheckpoint` to add the facility of retaining periodic checkpoints
#20547 commented on Jul 10, 2025 • 0 new comments
Add Deepspeed Zero 3 MiCS support (Issues #20378)
#20461 commented on Jul 19, 2025 • 0 new comments
Add `best_k_metrics` parameter to the `ModelCheckpoint`
#20457 commented on Jul 19, 2025 • 0 new comments
Call configure_module before freeze_before_training
#20428 commented on Jul 19, 2025 • 0 new comments
[Backend]: Support device backend registration for a wide range of third-party hardware
#20349 commented on Jul 24, 2025 • 0 new comments
Add compile_fn parameter for Trainer
#20269 commented on Jul 19, 2025 • 0 new comments
Feat: support reusable instance of `ModelCheckpoint`
#20202 commented on Jul 19, 2025 • 0 new comments
Fix `save_last` behavior in absence of validation
#20960 commented on Jul 7, 2025 • 0 new comments
Make asyncio checkpointing work if validate/fit is called more than once
#20952 commented on Aug 4, 2025 • 0 new comments
docs(csv_logs): Clarify CSV and YAML logging distinction and improve examples
#20951 commented on Jul 14, 2025 • 0 new comments
update ModelSummary
#20945 commented on Jul 26, 2025 • 0 new comments
Fix wrong behavior of `DDPStrategy` option with simple GAN training using DDP
#20936 commented on Jul 19, 2025 • 0 new comments
Fix: `no_grad` with AMP bug
#20921 commented on Jul 7, 2025 • 0 new comments
Add `save_on_exception` option to `ModelCheckpoint`
#20916 commented on Aug 2, 2025 • 0 new comments
feat: Default to RichProgressBar and RichModelSummary if rich is avai…
#20896 commented on Aug 4, 2025 • 0 new comments
DOC: Clarify DeviceStatsMonitor logged metrics
#20895 commented on Jul 19, 2025 • 0 new comments
bugfix: add support for `global_ordinal`, `local_ordinal`, `world_size` in xla
#20872 commented on Aug 4, 2025 • 0 new comments
PR: Fix Duplicate Metric Logging in MLFlowLogger to Prevent MLflow Database Errors
#20871 commented on Jul 23, 2025 • 0 new comments
Add documentation warning: Don’t use torch.profiler.profile context manager around Trainer methods
#20864 commented on Jul 19, 2025 • 0 new comments
Fix: Respect `required=False` in `add_lightning_class_args` when `subclass_mode=False`
#20856 commented on Jul 19, 2025 • 0 new comments
Add Callback for Opacus integration
#20853 commented on Jul 19, 2025 • 0 new comments
to_onnx return ONNXProgram
#20811 commented on Aug 4, 2025 • 0 new comments
Fix advanced profiler for python >=3.12
#20809 commented on Aug 4, 2025 • 0 new comments
Torch-Tensorrt Integration with LightningModule
#20808 commented on Jul 23, 2025 • 0 new comments
Support `grad_clip_norm_()` for FSDP
#20784 commented on Jul 19, 2025 • 0 new comments
Restoring Trainer State with Early Stop fails
#13225 commented on Jul 19, 2025 • 0 new comments
ReduceLROnPlateu within configure_optimizers behave abnormally
#20829 commented on Jul 19, 2025 • 0 new comments
Support PyTorch/XLA 2.7
#20852 commented on Jul 19, 2025 • 0 new comments
The progress bar shows wrong length when using multiple dataloaders mixing dataset and iterable dataset
#20695 commented on Jul 19, 2025 • 0 new comments
Cannot call self.log in evaluation_hooks after using trainer.predict, even if using a new trainer object.
#19101 commented on Jul 19, 2025 • 0 new comments
Error when learning on tpu
#20891 commented on Jul 19, 2025 • 0 new comments
RuntimeError: Bad StatusOr access: UNKNOWN: TPU initialization failed: Invalid --2a886c8_slice_builder_worker_addresses specified. Expected 4 worker addresses, got 1.
#20244 commented on Jul 19, 2025 • 0 new comments
Warnings when learning on tpu
#20890 commented on Jul 19, 2025 • 0 new comments
Weird bug when setting `val_check_interval` dynamically in `setup()`
#20894 commented on Jul 19, 2025 • 0 new comments
Tqdm print multi lines with refresh
#20909 commented on Jul 19, 2025 • 0 new comments
Logging in `on_test_epoch_end` with multiple dataloaders
#20885 commented on Jul 19, 2025 • 0 new comments
Ignore Keyword Arguments Outside of Callback Signature During `Fabric.call`
#20915 commented on Jul 19, 2025 • 0 new comments
Global step reset when restoring checkpoints with trainer.validate
#17127 commented on Jul 15, 2025 • 0 new comments
`ModelCheckpoint` not saving best model
#20657 commented on Jul 13, 2025 • 0 new comments
MLFlow logger with remote tracking fails with CLI
#16310 commented on Jul 11, 2025 • 0 new comments
stateful dataloaders do not load their state_dict if self.trainer.estimated_stepping_batches called beforehand
#20550 commented on Jul 9, 2025 • 0 new comments
Inconcistency in loading from checkpoint in LightningCLI
#20801 commented on Jul 9, 2025 • 0 new comments
Metrics get mapped twice to the same epoch in MLflow logger
#20902 commented on Jul 7, 2025 • 0 new comments
docs: fixed the `init_module` and deepspeed
#20175 commented on Jul 19, 2025 • 0 new comments
Call `configure_model` from LightningCLI
#19111 commented on Jul 19, 2025 • 0 new comments
deprecation: Is `frequency` key necessary in `lr_scheduler_config`?
#20714 commented on Aug 4, 2025 • 0 new comments
PyTorchProfiler: not showing CPU memory used even with `profile_memory=True`
#20339 commented on Aug 1, 2025 • 0 new comments
Validation takes place every N time
#13324 commented on Aug 1, 2025 • 0 new comments
LearningRateMonitor broken on MPS backend with Apple silicon
#20250 commented on Jul 31, 2025 • 0 new comments
Resuming should allow to differentiate what to resume (steps/opti/weights)
#5339 commented on Jul 29, 2025 • 0 new comments
Model diverges or struggles to converge with complex-valued tensors in DDP
#20480 commented on Jul 28, 2025 • 0 new comments
Gradient accumulation calcluation may be incorrect
#20350 commented on Jul 28, 2025 • 0 new comments
Error using wandb when learning on tpu
#20880 commented on Jul 27, 2025 • 0 new comments
Light / dark mode for documentation
#20396 commented on Jul 23, 2025 • 0 new comments
Parameters and Gradient is not logged by WandB under FSDP strategy
#17512 commented on Jul 21, 2025 • 0 new comments
deepspeed strategy can't save checkpoint, TypeError: cannot pickle `torch._C._distributed_c10d.ProcessGroup` object
#17369 commented on Jul 19, 2025 • 0 new comments
diff-svc(winerror3 when the training starts)
#20849 commented on Jul 19, 2025 • 0 new comments
Fabric FSDP with bitsandbytes plugin is not supported
#20855 commented on Jul 19, 2025 • 0 new comments
`add_lightning_class_args` `required` argument ignored if not using subclass mode
#20851 commented on Jul 19, 2025 • 0 new comments
Mlflow logging LR duplicate key issue with PostgreSQL DB #190
#20865 commented on Jul 19, 2025 • 0 new comments
lightning.fabric.utilities.exceptions.MisconfigurationException: No supported gpu backend found! Maybe latest gpu compability issue..?
#20626 commented on Jul 19, 2025 • 0 new comments