estel/dify

Files

yunqiqiliang 14e1c16cf2 Fix ClickZetta stability and reduce logging noise (#23632 )

Co-authored-by: autofix-ci[bot] <114827586+autofix-ci[bot]@users.noreply.github.com>

2025-08-08 22:57:47 +08:00

__init__.py

…

clickzetta_vector.py

Fix ClickZetta stability and reduce logging noise (#23632 )

2025-08-08 22:57:47 +08:00

README.md

Fix ClickZetta stability and reduce logging noise (#23632 )

2025-08-08 22:57:47 +08:00

README.md

Clickzetta Vector Database Integration

This module provides integration with Clickzetta Lakehouse as a vector database for Dify.

Features

Vector Storage: Store and retrieve high-dimensional vectors using Clickzetta's native VECTOR type
Vector Search: Efficient similarity search using HNSW algorithm
Full-Text Search: Leverage Clickzetta's inverted index for powerful text search capabilities
Hybrid Search: Combine vector similarity and full-text search for better results
Multi-language Support: Built-in support for Chinese, English, and Unicode text processing
Scalable: Leverage Clickzetta's distributed architecture for large-scale deployments

Configuration

Required Environment Variables

All seven configuration parameters are required:

# Authentication
CLICKZETTA_USERNAME=your_username
CLICKZETTA_PASSWORD=your_password

# Instance configuration
CLICKZETTA_INSTANCE=your_instance_id
CLICKZETTA_SERVICE=api.clickzetta.com
CLICKZETTA_WORKSPACE=your_workspace
CLICKZETTA_VCLUSTER=your_vcluster
CLICKZETTA_SCHEMA=your_schema

Optional Configuration

# Batch processing
CLICKZETTA_BATCH_SIZE=100

# Full-text search configuration
CLICKZETTA_ENABLE_INVERTED_INDEX=true
CLICKZETTA_ANALYZER_TYPE=chinese  # Options: keyword, english, chinese, unicode
CLICKZETTA_ANALYZER_MODE=smart    # Options: max_word, smart

# Vector search configuration
CLICKZETTA_VECTOR_DISTANCE_FUNCTION=cosine_distance  # Options: l2_distance, cosine_distance

Usage

1. Set Clickzetta as the Vector Store

In your Dify configuration, set:

VECTOR_STORE=clickzetta

2. Table Structure

Clickzetta will automatically create tables with the following structure:

CREATE TABLE <collection_name> (
    id STRING NOT NULL,
    content STRING NOT NULL,
    metadata JSON,
    vector VECTOR(FLOAT, <dimension>) NOT NULL,
    PRIMARY KEY (id)
);

-- Vector index for similarity search
CREATE VECTOR INDEX idx_<collection_name>_vec
ON TABLE <schema>.<collection_name>(vector) 
PROPERTIES (
    "distance.function" = "cosine_distance",
    "scalar.type" = "f32"
);

-- Inverted index for full-text search (if enabled)
CREATE INVERTED INDEX idx_<collection_name>_text
ON <schema>.<collection_name>(content)
PROPERTIES (
    "analyzer" = "chinese",
    "mode" = "smart"
);

Full-Text Search Capabilities

Clickzetta supports advanced full-text search with multiple analyzers:

Analyzer Types

keyword: No tokenization, treats the entire string as a single token
- Best for: Exact matching, IDs, codes
english: Designed for English text
- Features: Recognizes ASCII letters and numbers, converts to lowercase
- Best for: English content
chinese: Chinese text tokenizer
- Features: Recognizes Chinese and English characters, removes punctuation
- Best for: Chinese or mixed Chinese-English content
unicode: Multi-language tokenizer based on Unicode
- Features: Recognizes text boundaries in multiple languages
- Best for: Multi-language content

Analyzer Modes

max_word: Fine-grained tokenization (more tokens)
smart: Intelligent tokenization (balanced)

Full-Text Search Functions

MATCH_ALL(column, query): All terms must be present
MATCH_ANY(column, query): At least one term must be present
MATCH_PHRASE(column, query): Exact phrase matching
MATCH_PHRASE_PREFIX(column, query): Phrase prefix matching
MATCH_REGEXP(column, pattern): Regular expression matching

Performance Optimization

Vector Search

Adjust exploration factor for accuracy vs speed trade-off:
```
SET cz.vector.index.search.ef=64;
```
Use appropriate distance functions:
- cosine_distance: Best for normalized embeddings (e.g., from language models)
- l2_distance: Best for raw feature vectors

Full-Text Search

Choose the right analyzer:
- Use keyword for exact matching
- Use language-specific analyzers for better tokenization
Combine with vector search:
- Pre-filter with full-text search for better performance
- Use hybrid search for improved relevance

Troubleshooting

Connection Issues

Verify all 7 required configuration parameters are set
Check network connectivity to Clickzetta service
Ensure the user has proper permissions on the schema

Search Performance

Verify vector index exists:
```
SHOW INDEX FROM <schema>.<table_name>;
```
Check if vector index is being used:
```
EXPLAIN SELECT ... WHERE l2_distance(...) < threshold;
```
Look for vector_index_search_type in the execution plan.

Full-Text Search Not Working

Verify inverted index is created
Check analyzer configuration matches your content language

Use TOKENIZE() function to test tokenization:

SELECT TOKENIZE('your text', map('analyzer', 'chinese', 'mode', 'smart'));

Limitations

Vector operations don't support ORDER BY or GROUP BY directly on vector columns
Full-text search relevance scores are not provided by Clickzetta
Inverted index creation may fail for very large existing tables (continue without error)
Index naming constraints:
- Index names must be unique within a schema
- Only one vector index can be created per column
- The implementation uses timestamps to ensure unique index names
A column can only have one vector index at a time

README.md

Clickzetta Vector Database Integration

Features

Configuration

Required Environment Variables

Optional Configuration

Usage

1. Set Clickzetta as the Vector Store

2. Table Structure

Full-Text Search Capabilities

Analyzer Types

Analyzer Modes

Full-Text Search Functions

Performance Optimization

Vector Search

Full-Text Search

Troubleshooting

Connection Issues

Search Performance

Full-Text Search Not Working

Limitations

References