Skip to content

[NVIDIA GPU] Introduce Monitoring Integration #12581

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 0 commits into from

Conversation

strawgate
Copy link
Contributor

@strawgate strawgate commented Feb 4, 2025

Proposed commit message

Introduce NVIDIA GPU Monitoring Integration

Checklist

  • I have reviewed tips for building integrations and this pull request is aligned with them.
  • I have verified that all data streams collect metrics or logs.
  • I have added an entry to my package's changelog.yml file.
  • I have verified that Kibana version constraints are current according to guidelines.
  • I have verified that any added dashboard complies with Kibana's Dashboard good practices

Author's Checklist

How to test this PR locally

Deploy NVIDIA DGCM on a device with an NVIDIA GPU to get a prometheus metrics endpoint that you can provide to the integration.

If you have docker this just requires:

docker run -d --gpus all --rm -p 9400:9400 nvcr.io/nvidia/k8s/dcgm-exporter:3.3.9-3.6.1-ubuntu22.04
curl localhost:9400/metrics

Configure the integration to point at the host running the container and GPU http://nvidiahost:9400/metrics

Some metrics are not enabled by default with the container, enabling all metrics requires some extra steps.

Related issues

Fixes #11930

Screenshots

WIP:

bill-easton-test kb us-central1 gcp cloud es io_9243_app_dashboards (1)
bill-easton-test kb us-central1 gcp cloud es io_9243_app_dashboards

@strawgate strawgate requested a review from a team as a code owner February 4, 2025 04:13
@elasticmachine
Copy link

elasticmachine commented Feb 4, 2025

💔 Build Failed

Failed CI Steps

History

Copy link

Quality Gate passed Quality Gate passed

Issues
0 New issues
0 Fixed issues
0 Accepted issues

Measures
0 Security Hotspots
No data about Coverage
No data about Duplication

See analysis details on SonarQube

@mauri870
Copy link
Member

@strawgate Did you closed this by mistake?

@strawgate
Copy link
Contributor Author

Will open a new pr

@strawgate
Copy link
Contributor Author

replaced by #12768

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Nvidia GPU] New Integration for Nvidia GPU Monitoring
3 participants