Vectors in SAP HANA Cloud
Vectors are geometric objects defined by a direction and a magnitude. Numerically, vectors can be represented as a sequence of numbers, such as [0.2, 0.8, -0.4, 0.6, ...].
The number of elements in a vector determines its dimensionality. Each number in the sequence corresponds to a specific dimension and contributes to the overall representation of the data point. That said, the actual numbers within the vector are not meaningful on their own. The relative values and relationships between the numbers capture the semantic information and allow algorithms to process and analyze the data effectively.
Vector embeddings are vectors produced by embedding models, which are specialized neural networks designed to convert objects into vector representations. The type of objects that can be vectorized and the method of vectorization depend on the specific embedding model used. The primary goal of an embedding model is to map similar objects to similar vectors.
SAP HANA Cloud vector engine facilitates the storage and analysis of complex and unstructured vector data (embeddings) into a format that can be seamlessly processed, compared, and used in building various intelligent data applications and adding more context in case of GenAI scenarios.
SAP HANA Cloud vector engine supports:
Storage of vectors alongside other enterprise data within the same SAP HANA Cloud database.
Utilization of SQL to interact with all types of data, including vectors.
Use of CRUD (Create, Read, Update, Delete) operations on vectors using SQL.
Integration of spatial, graph, JSON, and custom SQLScript with vector-based queries.
Implementation of vector use cases in solutions through SAP HANA Cloud clients (such as Python), the Python Machine Learning Client for SAP HANA (hana-ml), and the SAP Cloud Application Programming Model (CAP).
To store vector data, SAP HANA Cloud includes the built-in vector data type REAL_VECTOR, which consists of REAL elements (IEEE 754 single-precision floating-point). The dimensionality of a REAL_VECTOR column ranges from 1 to 65.000.
The data type REAL_VECTOR can be used like other SAP HANA SQL data types, but be aware of some current limitations:
Ordering: No order is defined on REAL_VECTOR. Operations relying on ordering (for example grouping, ordering, comparison) cannot be applied to vectors.
Arithmetic: REAL_VECTOR cannot be used in arithmetic expressions. For example, vector addition is not supported.
Note
Check the roadmap to see updates regarding vector functions:SAP Road Map Explorer - Support for additional vector functions
Table Types: REAL_VECTOR columns are not supported in row tables.
Partitioning: REAL_VECTOR columns cannot be used as partitioning keys when partitioning tables.
Vectorization is the process of taking different types of data (for example unstructured data like text or images) and converting them into numerical vectors.
Note
While vectorization is commonly associated with unstructured data like text or images, it can also be useful for structured data. For example, many machine learning models (such as neural networks or gradient-boosting models) work best with vectorized input.
Let's explore some examples of vectorization:
- Text vectorization
- Image vectorization
Text Vectorization
Text vectorization is the process of turning words and documents into mathematical representations. Textual data, comprising words, sentences, or documents, is inherently qualitative and unstructured. Vectorization algorithms transform this text into numerical vectors by encoding various linguistic features such as word frequency, word context, or word relationships. Some examples of techniques in text vectorization include:
BoW (bag-of-words): represents documents as vectors, where it treats a document as a "bag" of words, where the order or structure of the words doesn’t matter. Instead, the focus is on the presence or frequency of individual words.
TF-IDF (Term Frequency-Inverse Document Frequency): in contrast to simple word frequency, TF-IDF balances common and rare words so that the most meaningful terms are emphasized.
BM25: is an improvement over TF-IDF, in that it considers the length of the document and also dampens the effect of having many occurrences of a word in a document.
BERT (Bidirectional Encoder Representations from Transformers): is an example of a transformer. Transformers are a type of neural network architecture that have become a cornerstone in Natural Language Processing (NLP). It works to understand context in both directions – what comes before and after a word (called bidirectional context understanding). Essentially, it looks at the whole text simultaneously to grasp deeper meanings and context from the text.
Image Vectorization
Images can be represented as vectors, either by raw pixel values (for simpler tasks) or by more complex features extracted by models like convolutional neural networks (CNNs). These vectors encode the key features of the image, for tasks like image classification or similarity search.
Note
A convolutional neural network (CNN) is a network architecture for deep learning that learns directly from data. CNNs are particularly useful for finding patterns in images to recognize objects, classes, and categories.Image vectorization can be useful for reverse product search. In this case, you can identify a product by its image, if you do not know the name or bar code. Another use case could be finding similar products.