Using the SAP HANA Cloud vector engine in data modeling

Objective

After completing this lesson, you will be able to identify vector data

Vectors and Vectorization

Vectors in SAP HANA Cloud

Vectors are geometric objects defined by a direction and a magnitude. Numerically, vectors can be represented as a sequence of numbers, such as [0.2, 0.8, -0.4, 0.6, ...].

The number of elements in a vector determines its dimensionality. Each number in the sequence corresponds to a specific dimension and contributes to the overall representation of the data point. That said, the actual numbers within the vector are not meaningful on their own. The relative values and relationships between the numbers capture the semantic information and allow algorithms to process and analyze the data effectively.

Vector embeddings are vectors produced by embedding models, which are specialized neural networks designed to convert objects into vector representations. The type of objects that can be vectorized and the method of vectorization depend on the specific embedding model used. The primary goal of an embedding model is to map similar objects to similar vectors.

SAP HANA Cloud vector engine facilitates the storage and analysis of complex and unstructured vector data (embeddings) into a format that can be seamlessly processed, compared, and used in building various intelligent data applications and adding more context in case of GenAI scenarios.

SAP HANA Cloud vector engine supports:

  • Storage of vectors alongside other enterprise data within the same SAP HANA Cloud database.

  • Utilization of SQL to interact with all types of data, including vectors.

  • Use of CRUD (Create, Read, Update, Delete) operations on vectors using SQL.

  • Integration of spatial, graph, JSON, and custom SQLScript with vector-based queries.

  • Implementation of vector use cases in solutions through SAP HANA Cloud clients (such as Python), the Python Machine Learning Client for SAP HANA (hana-ml), and the SAP Cloud Application Programming Model (CAP).

To store vector data, SAP HANA Cloud includes the built-in vector data type REAL_VECTOR, which consists of REAL elements (IEEE 754 single-precision floating-point). The dimensionality of a REAL_VECTOR column ranges from 1 to 65.000.

The data type REAL_VECTOR can be used like other SAP HANA SQL data types, but be aware of some current limitations:

  • Ordering: No order is defined on REAL_VECTOR. Operations relying on ordering (for example grouping, ordering, comparison) cannot be applied to vectors.

  • Arithmetic: REAL_VECTOR cannot be used in arithmetic expressions. For example, vector addition is not supported.

    Note

    Check the roadmap to see updates regarding vector functions:

    SAP Road Map Explorer - Support for additional vector functions

  • Table Types: REAL_VECTOR columns are not supported in row tables.

  • Partitioning: REAL_VECTOR columns cannot be used as partitioning keys when partitioning tables.

Vectorization is the process of taking different types of data (for example unstructured data like text or images) and converting them into numerical vectors.

Note

While vectorization is commonly associated with unstructured data like text or images, it can also be useful for structured data. For example, many machine learning models (such as neural networks or gradient-boosting models) work best with vectorized input.

Let's explore some examples of vectorization:

  • Text vectorization
  • Image vectorization

Text Vectorization

Text vectorization is the process of turning words and documents into mathematical representations. Textual data, comprising words, sentences, or documents, is inherently qualitative and unstructured. Vectorization algorithms transform this text into numerical vectors by encoding various linguistic features such as word frequency, word context, or word relationships. Some examples of techniques in text vectorization include:

  • BoW (bag-of-words): represents documents as vectors, where it treats a document as a "bag" of words, where the order or structure of the words doesn’t matter. Instead, the focus is on the presence or frequency of individual words.

  • TF-IDF (Term Frequency-Inverse Document Frequency): in contrast to simple word frequency, TF-IDF balances common and rare words so that the most meaningful terms are emphasized.

  • BM25: is an improvement over TF-IDF, in that it considers the length of the document and also dampens the effect of having many occurrences of a word in a document.

  • BERT (Bidirectional Encoder Representations from Transformers): is an example of a transformer. Transformers are a type of neural network architecture that have become a cornerstone in Natural Language Processing (NLP). It works to understand context in both directions – what comes before and after a word (called bidirectional context understanding). Essentially, it looks at the whole text simultaneously to grasp deeper meanings and context from the text.

Image Vectorization

Images can be represented as vectors, either by raw pixel values (for simpler tasks) or by more complex features extracted by models like convolutional neural networks (CNNs). These vectors encode the key features of the image, for tasks like image classification or similarity search.

Note

A convolutional neural network (CNN) is a network architecture for deep learning that learns directly from data. CNNs are particularly useful for finding patterns in images to recognize objects, classes, and categories.

Image vectorization can be useful for reverse product search. In this case, you can identify a product by its image, if you do not know the name or bar code. Another use case could be finding similar products.

Vectorization in SAP HANA Cloud

Let's explore vectorization in SAP HANA Cloud.

The VECTOR_EMBEDDING Function

SAP HANA Cloud offers the VECTOR_EMBEDDING function to create a vector embedding from a text. This function requires that Natural Language Processing (NLP) is enabled in SAP HANA Cloud.

The VECTOR_EMBEDDING function uses the following arguments:

  • The first argument is that the text has to be embedded.

  • The second argument specifies the type of the text.

    • 'DOCUMENT': For texts that are stored in the database and that must be searchable.

    • 'QUERY': For texts that are used in queries.

  • The third argument specifies which model and which version of the model are used. For example a model, which is based on transformer structure SAP_NEB.20240715.

    Note

    A list of available models and versions is available in VECTOR_EMBEDDING Function (Vector).

For example, the following query returns a vector embedding for the text "Hello World", using SAP_NEB.20240715:

Code Snippet
1
SELECT VECTOR_EMBEDDING('Hello world!', 'DOCUMENT', 'SAP_NEB.20240715') FROM DUMMY;
Screenshot of a database query interface showing the result of a vector embedding query for Hello world!. JSON view displays a list of numerical values as the embedding result for the input text.

When you double-click on the result row, all details of the vector are shown, which are many dimensions in this example.

The TO_REAL_VECTOR Function

In SAP HANA Cloud, you can use the TO_REAL_VECTOR function to construct vectors from:

  • A textual representation. For example:

    Code Snippet
    1
    SELECT TO_REAL_VECTOR('[1,2,3]') FROM DUMMY;
  • A binary representation. For example:

    Code Snippet
    1
    SELECT TO_REAL_VECTOR(x'030000000000803F0000004000004040') FROM DUMMY;
  • An array consisting of numerical elements. For example:

    Code Snippet
    1
    SELECT TO_REAL_VECTOR(ARRAY(1,2,3)) FROM DUMMY;
Three SQL query results using TO_REAL_VECTOR from DUMMY. The outputs are [1, 2, 3] in each case, with different data input formats: '(1,2,3)', hexadecimal and ARRAY(1,2,3).

It's also possible to serialize vectors. A vector can be serialized to the following:

  • A textual representation with TO_NVARCHAR or TO_NCLOB

  • A binary representation with TO_VARBINARY or TO_BLOB

  • An array consisting of REAL elements with TO_ARRAY

SQL query results showing conversion between data types using functions TO_NVARCHAR, TO_ARRAY, and TO_REAL_VECTOR on the input array [1,2,3]. Output shows different transformations.

Vector Data Storage in SAP HANA Cloud

Let's look at how vector data can be stored and maintained in SAP HANA Cloud.

Creating Tables with REAL_VECTOR Columns

Tables containing columns with the data type REAL_VECTOR are created similarly to regular tables.

The following example creates a table T1 with a column VECTOR_COLUMN (data type REAL_VECTOR), which can accept vectors of any dimension:

Code Snippet
1
CREATE TABLE T1 ( id INT PRIMARY KEY, VECTOR_COLUMN REAL_VECTOR );

For columns with data type REAL_VECTOR, you can specify an optional dimension constraint. This constraint ensures that only vectors with the specified dimension can be stored in these columns.

The following example creates a table T2 with a column VECTOR_COLUMN (data type REAL_VECTOR), that is constrained to accept only vectors with a dimension of 3:

Code Snippet
1
CREATE TABLE T2 ( id INT PRIMARY KEY, VECTOR_COLUMN REAL_VECTOR(3) );

Inserting Vectors

Vectors can be inserted in a manner similar to any other data type. However, it’s important to note that there are no implicit conversions to the REAL_VECTOR data type. Therefore, a vector must be explicitly constructed before it can be inserted into the database.

You can use import tools to upload vector embeddings into SAP HANA Cloud. The data type REAL_VECTOR supports importing and exporting using data files. The following data formats are supported:

  • CSV
  • Parquet

You can also use SQL commands in SAP HANA Cloud to construct and insert vector data.

Here is an example that demonstrates how to construct and insert three vectors into table T1. Notice that the TO_REAL_VECTOR function is used to construct the vector based on an array representation:

Code Snippet
123
INSERT INTO T1 (ID, VECTOR_COLUMN) VALUES (1, TO_REAL_VECTOR('[0, 40, 80]')); INSERT INTO T1 (ID, VECTOR_COLUMN) VALUES (2, TO_REAL_VECTOR('[20, 60, 80]')); INSERT INTO T1 (ID, VECTOR_COLUMN) VALUES (3, TO_REAL_VECTOR('[42, 73, 99]'));

Note

If many vectors need to be inserted into a column, it is recommended to use parameterized statements combined with batch inserts. For example, you can use Python. For information about using the hdbcli Python module, see Python Application Programming:

Python Application Programming

When texts are inserted in a table, the VECTOR_EMBEDDING function can be used to automatically generate embeddings. Let's have a look at the available options:

  • Create embeddings with generated columns.

  • Create embeddings using triggers.

Create Embeddings with Generated Columns

Let's look at the following example, where a table T1 is created, including column VECTOR_COLUMN (data type REAL_VECTOR):

Code Snippet
123456
CREATE TABLE T1 ( TITLE NVARCHAR(100), PARAGRAPH_ID INTEGER, PARAGRAPH NVARCHAR(5000), VECTOR_COLUMN REAL_VECTOR GENERATED ALWAYS AS VECTOR_EMBEDDING(PARAGRAPH, 'DOCUMENT', 'SAP_NEB.20240715') );

When data is inserted in this table, the values for column VECTOR_COLUMN are updated by the automatically generated (GENERATED ALWAYS AS) vector embeddings based on the values of column PARAGRAPH (NVARCHAR (5000)).

If the model you are using has a rather small token limit, then NVARCHAR(5000) columns should be sufficient.

If your texts are stored in NCLOB columns, embeddings have to be created using triggers.

Create Embeddings Using Triggers

Let's look at the following example, where a table T2 is created, including column VECTOR_COLUMN (data type REAL_VECTOR):

Code Snippet
123456
CREATE TABLE T2 ( TITLE NVARCHAR(100), PARAGRAPH_ID INTEGER, PARAGRAPH NCLOB, VECTOR_COLUMN REAL_VECTOR );

Let's now see how the triggers are defined:

Code Snippet
12345678
CREATE TRIGGER CREATE_MY_VECTOR_COLUMN BEFORE INSERT OR UPDATE OF PARAGRAPH ON T2 REFERENCING NEW ROW AS newrow ONLINE BEGIN newrow.VECTOR_COLUMN = VECTOR_EMBEDDING( :newrow.PARAGRAPH, 'DOCUMENT', 'SAP_NEB.20240715'); END;

The insertion of a new row triggers the vector embedding function and the values for column VECTOR_COLUMN are updated by the automatically generated vector embeddings based on the values of column PARAGRAPH (NCLOB).

Update and Delete Vectors

The following example shows how to update the vector with ID 1 in table T3:

Code Snippet
1
UPDATE T3 SET VECTOR_COLUMN = TO_REAL_VECTOR('[17,19,23]') WHERE ID = 1;

The following example shows how to delete the vector with ID 3 in table T3:

Code Snippet
1
DELETE FROM T3 WHERE ID = 3;

Memory Consumption and Sizing Considerations

Memory consumption and sizing are influenced by the dimension of the vector. The amount of memory in bytes, ( m ), consumed by a single vector ( v ) with dimension ( dim(v) ) can be computed using the following formula:

The image show the formula to calculate the memory consumption.

For instance, a vector with 1536 dimensions requires 8 + (4 * 1536) = 6152 bytes.

To compute the memory needed by a column, multiply the memory usage of a single vector by the number of rows in the column. For example, storing 1000000 vectors with 1536 dimensions requires (1000000 * 6152) = 6152000000 bytes, which is roughly 5,73 gigabytes of memory.

As a rule of thumb, the total memory of an SAP HANA system should be at least twice the memory occupied by the data.