Skip to content

kerberos auth for proxy #660

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
136 changes: 136 additions & 0 deletions CLAUDE.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,136 @@
# CLAUDE.md

This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.

## Repository Overview

This is the official Python client for Databricks SQL. It implements PEP 249 (DB API 2.0) and uses Apache Thrift for communication with Databricks clusters/SQL warehouses.

## Essential Development Commands

```bash
# Install dependencies
poetry install

# Install with PyArrow support (recommended)
poetry install --all-extras

# Run unit tests
poetry run python -m pytest tests/unit

# Run specific test
poetry run python -m pytest tests/unit/test_client.py::ClientTestSuite::test_method_name

# Code formatting (required before commits)
poetry run black src

# Type checking
poetry run mypy --install-types --non-interactive src

# Check formatting without changing files
poetry run black src --check
```

## High-Level Architecture

### Core Components

1. **Client Layer** (`src/databricks/sql/client.py`)
- Main entry point implementing DB API 2.0
- Handles connections, cursors, and query execution
- Key classes: `Connection`, `Cursor`

2. **Backend Layer** (`src/databricks/sql/backend/`)
- Thrift-based communication with Databricks
- Handles protocol-level operations
- Key files: `thrift_backend.py`, `databricks_client.py`
- SEA (Streaming Execute API) support in `experimental/backend/sea_backend.py`

3. **Authentication** (`src/databricks/sql/auth/`)
- Multiple auth methods: OAuth U2M/M2M, PAT, custom providers
- Authentication flow abstraction
- OAuth persistence support for token caching

4. **Data Transfer** (`src/databricks/sql/cloudfetch/`)
- Cloud fetch for large results
- Arrow format support for efficiency
- Handles data pagination and streaming
- Result set management in `result_set.py`

5. **Parameters** (`src/databricks/sql/parameters/`)
- Native parameter support (v3.0.0+) - server-side parameterization
- Inline parameters (legacy) - client-side interpolation
- SQL injection prevention
- Type mapping and conversion

6. **Telemetry** (`src/databricks/sql/telemetry/`)
- Usage metrics and performance monitoring
- Configurable batch processing and time-based flushing
- Server-side flag integration

### Key Design Patterns

- **Result Sets**: Uses Arrow format by default for efficient data transfer
- **Error Handling**: Comprehensive retry logic with exponential backoff
- **Resource Management**: Context managers for proper cleanup
- **Type System**: Strong typing with MyPy throughout

## Testing Strategy

### Unit Tests (No Databricks account needed)
```bash
poetry run python -m pytest tests/unit
```

### E2E Tests (Requires Databricks account)
1. Set environment variables or create `test.env` file:
```bash
export DATABRICKS_SERVER_HOSTNAME="****"
export DATABRICKS_HTTP_PATH="/sql/1.0/endpoints/****"
export DATABRICKS_TOKEN="dapi****"
```
2. Run: `poetry run python -m pytest tests/e2e`

Test organization:
- `tests/unit/` - Fast, isolated unit tests
- `tests/e2e/` - Integration tests against real Databricks
- Test files follow `test_*.py` naming convention
- Test suites: core, large queries, staging ingestion, retry logic

## Important Development Notes

1. **Dependency Management**: Always use Poetry, never pip directly
2. **Code Style**: Black formatter with 100-char line limit (PEP 8 with this exception)
3. **Type Annotations**: Required for all new code
4. **Thrift Files**: Generated code in `thrift_api/` - do not edit manually
5. **Parameter Security**: Always use native parameters, never string interpolation
6. **Arrow Support**: Optional but highly recommended for performance
7. **Python Support**: 3.8+ (up to 3.13)
8. **DCO**: Sign commits with Developer Certificate of Origin

## Common Development Tasks

### Adding a New Feature
1. Implement in appropriate module under `src/databricks/sql/`
2. Add unit tests in `tests/unit/`
3. Add integration tests in `tests/e2e/` if needed
4. Update type hints and ensure MyPy passes
5. Run Black formatter before committing

### Debugging Connection Issues
- Check auth configuration in `auth/` modules
- Review retry logic in `src/databricks/sql/utils.py`
- Enable debug logging for detailed trace

### Working with Thrift
- Protocol definitions in `src/databricks/sql/thrift_api/`
- Backend implementation in `backend/thrift_backend.py`
- Don't modify generated Thrift files directly

### Running Examples
Example scripts are in `examples/` directory:
- Basic query execution examples
- OAuth authentication patterns
- Parameter usage (native vs inline)
- Staging ingestion operations
- Custom credential providers
175 changes: 175 additions & 0 deletions docs/proxy_configuration.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,175 @@
# Proxy Configuration Guide

This guide explains how to configure the Databricks SQL Connector for Python to work with HTTP/HTTPS proxies, including support for Kerberos authentication.

## Table of Contents
- [Basic Proxy Configuration](#basic-proxy-configuration)
- [Proxy with Basic Authentication](#proxy-with-basic-authentication)
- [Proxy with Kerberos Authentication](#proxy-with-kerberos-authentication)
- [Troubleshooting](#troubleshooting)

## Basic Proxy Configuration

The connector automatically detects proxy settings from environment variables:

```bash
# For HTTPS connections (most common)
export HTTPS_PROXY=http://proxy.example.com:8080

# For HTTP connections
export HTTP_PROXY=http://proxy.example.com:8080

# Hosts to bypass proxy
export NO_PROXY=localhost,127.0.0.1,.internal.company.com
```

Then connect normally:

```python
from databricks import sql

connection = sql.connect(
server_hostname="your-workspace.databricks.com",
http_path="/sql/1.0/warehouses/your-warehouse",
access_token="your-token"
)
```

## Proxy with Basic Authentication

For proxies requiring username/password authentication, include credentials in the proxy URL:

```bash
export HTTPS_PROXY=http://username:password@proxy.example.com:8080
```

## Proxy with Kerberos Authentication

For enterprise environments using Kerberos authentication on proxies:

### Prerequisites

1. Install Kerberos dependencies:
```bash
pip install databricks-sql-connector[kerberos]
```

2. Obtain a valid Kerberos ticket:
```bash
kinit user@EXAMPLE.COM
```

3. Set proxy environment variables (without credentials):
```bash
export HTTPS_PROXY=http://proxy.example.com:8080
```

### Connection with Kerberos Proxy

```python
from databricks import sql

connection = sql.connect(
server_hostname="your-workspace.databricks.com",
http_path="/sql/1.0/warehouses/your-warehouse",
access_token="your-databricks-token",

# Enable Kerberos proxy authentication
_proxy_auth_type="kerberos",

# Optional Kerberos settings
_proxy_kerberos_service_name="HTTP", # Default: "HTTP"
_proxy_kerberos_principal="user@EXAMPLE.COM", # Optional: uses default if not set
_proxy_kerberos_delegate=False, # Enable credential delegation
_proxy_kerberos_mutual_auth="REQUIRED" # Options: REQUIRED, OPTIONAL, DISABLED
)
```

### Kerberos Configuration Options

| Parameter | Default | Description |
|-----------|---------|-------------|
| `_proxy_auth_type` | None | Set to `"kerberos"` to enable Kerberos proxy auth |
| `_proxy_kerberos_service_name` | `"HTTP"` | Kerberos service name for the proxy |
| `_proxy_kerberos_principal` | None | Specific principal to use (uses default if not set) |
| `_proxy_kerberos_delegate` | `False` | Whether to delegate credentials to the proxy |
| `_proxy_kerberos_mutual_auth` | `"REQUIRED"` | Mutual authentication requirement level |

### Example: Custom Kerberos Settings

```python
# Using a specific service principal with delegation
connection = sql.connect(
server_hostname="your-workspace.databricks.com",
http_path="/sql/1.0/warehouses/your-warehouse",
access_token="your-token",

_proxy_auth_type="kerberos",
_proxy_kerberos_service_name="HTTP",
_proxy_kerberos_principal="dbuser@CORP.EXAMPLE.COM",
_proxy_kerberos_delegate=True, # Allow credential delegation
_proxy_kerberos_mutual_auth="OPTIONAL" # Less strict verification
)
```

## Troubleshooting

### Kerberos Authentication Issues

1. **No Kerberos ticket**:
```bash
# Check if you have a valid ticket
klist

# If not, obtain one
kinit user@EXAMPLE.COM
```

2. **Wrong service principal**:
- Check with your IT team for the correct proxy service principal name
- It's typically `HTTP@proxy.example.com` but may vary

3. **Import errors**:
```
ImportError: Kerberos proxy authentication requires 'pykerberos'
```
Solution: Install with `pip install databricks-sql-connector[kerberos]`

### Proxy Connection Issues

1. **Enable debug logging**:
```python
import logging
logging.basicConfig(level=logging.DEBUG)
```

2. **Test proxy connectivity**:
```bash
# Test if proxy is reachable
curl -x http://proxy.example.com:8080 https://www.databricks.com
```

3. **Verify environment variables**:
```python
import os
print(f"HTTPS_PROXY: {os.environ.get('HTTPS_PROXY')}")
print(f"NO_PROXY: {os.environ.get('NO_PROXY')}")
```

### Platform-Specific Notes

- **Linux/Mac**: Uses `pykerberos` library
- **Windows**: Uses `winkerberos` library (automatically selected)
- **Docker/Containers**: Ensure Kerberos configuration files are mounted

## Security Considerations

1. **Avoid hardcoding credentials** - Use environment variables or secure credential stores
2. **Use HTTPS connections** - Even through proxies, maintain encrypted connections to Databricks
3. **Credential delegation** - Only enable `_proxy_kerberos_delegate=True` if required by your proxy
4. **Mutual authentication** - Keep `_proxy_kerberos_mutual_auth="REQUIRED"` for maximum security

## See Also

- [Kerberos Proxy Example](../examples/kerberos_proxy_auth.py)
- [Databricks SQL Connector Documentation](https://docs.databricks.com/dev-tools/python-sql-connector.html)
Loading
Loading