Clickzetta Vector Database Integration
This module provides integration with Clickzetta Lakehouse as a vector database for Dify.
Features
- Vector Storage: Store and retrieve high-dimensional vectors using Clickzetta's native VECTOR type
- Vector Search: Efficient similarity search using HNSW algorithm
- Full-Text Search: Leverage Clickzetta's inverted index for powerful text search capabilities
- Hybrid Search: Combine vector similarity and full-text search for better results
- Multi-language Support: Built-in support for Chinese, English, and Unicode text processing
- Scalable: Leverage Clickzetta's distributed architecture for large-scale deployments
Configuration
Required Environment Variables
All seven configuration parameters are required:
# Authentication
CLICKZETTA_USERNAME=your_username
CLICKZETTA_PASSWORD=your_password
# Instance configuration
CLICKZETTA_INSTANCE=your_instance_id
CLICKZETTA_SERVICE=api.clickzetta.com
CLICKZETTA_WORKSPACE=your_workspace
CLICKZETTA_VCLUSTER=your_vcluster
CLICKZETTA_SCHEMA=your_schema
Optional Configuration
# Batch processing
CLICKZETTA_BATCH_SIZE=100
# Full-text search configuration
CLICKZETTA_ENABLE_INVERTED_INDEX=true
CLICKZETTA_ANALYZER_TYPE=chinese # Options: keyword, english, chinese, unicode
CLICKZETTA_ANALYZER_MODE=smart # Options: max_word, smart
# Vector search configuration
CLICKZETTA_VECTOR_DISTANCE_FUNCTION=cosine_distance # Options: l2_distance, cosine_distance
Usage
1. Set Clickzetta as the Vector Store
In your Dify configuration, set:
VECTOR_STORE=clickzetta
2. Table Structure
Clickzetta will automatically create tables with the following structure:
CREATE TABLE <collection_name> (
id STRING NOT NULL,
content STRING NOT NULL,
metadata JSON,
vector VECTOR(FLOAT, <dimension>) NOT NULL,
PRIMARY KEY (id)
);
-- Vector index for similarity search
CREATE VECTOR INDEX idx_<collection_name>_vec
ON TABLE <schema>.<collection_name>(vector)
PROPERTIES (
"distance.function" = "cosine_distance",
"scalar.type" = "f32"
);
-- Inverted index for full-text search (if enabled)
CREATE INVERTED INDEX idx_<collection_name>_text
ON <schema>.<collection_name>(content)
PROPERTIES (
"analyzer" = "chinese",
"mode" = "smart"
);
Full-Text Search Capabilities
Clickzetta supports advanced full-text search with multiple analyzers:
Analyzer Types
-
keyword: No tokenization, treats the entire string as a single token
- Best for: Exact matching, IDs, codes
-
english: Designed for English text
- Features: Recognizes ASCII letters and numbers, converts to lowercase
- Best for: English content
-
chinese: Chinese text tokenizer
- Features: Recognizes Chinese and English characters, removes punctuation
- Best for: Chinese or mixed Chinese-English content
-
unicode: Multi-language tokenizer based on Unicode
- Features: Recognizes text boundaries in multiple languages
- Best for: Multi-language content
Analyzer Modes
- max_word: Fine-grained tokenization (more tokens)
- smart: Intelligent tokenization (balanced)
Full-Text Search Functions
MATCH_ALL(column, query)
: All terms must be presentMATCH_ANY(column, query)
: At least one term must be presentMATCH_PHRASE(column, query)
: Exact phrase matchingMATCH_PHRASE_PREFIX(column, query)
: Phrase prefix matchingMATCH_REGEXP(column, pattern)
: Regular expression matching
Performance Optimization
Vector Search
-
Adjust exploration factor for accuracy vs speed trade-off:
SET cz.vector.index.search.ef=64;
-
Use appropriate distance functions:
cosine_distance
: Best for normalized embeddings (e.g., from language models)l2_distance
: Best for raw feature vectors
Full-Text Search
-
Choose the right analyzer:
- Use
keyword
for exact matching - Use language-specific analyzers for better tokenization
- Use
-
Combine with vector search:
- Pre-filter with full-text search for better performance
- Use hybrid search for improved relevance
Troubleshooting
Connection Issues
- Verify all 7 required configuration parameters are set
- Check network connectivity to Clickzetta service
- Ensure the user has proper permissions on the schema
Search Performance
-
Verify vector index exists:
SHOW INDEX FROM <schema>.<table_name>;
-
Check if vector index is being used:
EXPLAIN SELECT ... WHERE l2_distance(...) < threshold;
Look for
vector_index_search_type
in the execution plan.
Full-Text Search Not Working
- Verify inverted index is created
- Check analyzer configuration matches your content language
- Use
TOKENIZE()
function to test tokenization:SELECT TOKENIZE('your text', map('analyzer', 'chinese', 'mode', 'smart'));
Limitations
- Vector operations don't support
ORDER BY
orGROUP BY
directly on vector columns - Full-text search relevance scores are not provided by Clickzetta
- Inverted index creation may fail for very large existing tables (continue without error)
- Index naming constraints:
- Index names must be unique within a schema
- Only one vector index can be created per column
- The implementation uses timestamps to ensure unique index names
- A column can only have one vector index at a time