Using the SAP HANA Cloud vector engine in data modeling

Objectives

After completing this lesson, you will be able to:
  • Identify vector data
  • Describe analysis with SAP HANA Cloud vector engine

Vectors and Vectorization

Vectors in SAP HANA Cloud

Vectors are geometric objects defined by a direction and a magnitude. Numerically, vectors can be represented as a sequence of numbers, such as [0.2, 0.8, -0.4, 0.6, ...].

The number of elements in a vector determines its dimensionality. Each number in the sequence corresponds to a specific dimension and contributes to the overall representation of the data point. That said, the actual numbers within the vector are not meaningful on their own. The relative values and relationships between the numbers capture the semantic information and allow algorithms to process and analyze the data effectively.

Vector embeddings are vectors produced by embedding models, which are specialized neural networks designed to convert objects into vector representations. The type of objects that can be vectorized and the method of vectorization depend on the specific embedding model used. The primary goal of an embedding model is to map similar objects to similar vectors.

SAP HANA Cloud vector engine facilitates the storage and analysis of complex and unstructured vector data (embeddings) into a format that can be seamlessly processed, compared, and used in building various intelligent data applications and adding more context in case of GenAI scenarios.

SAP HANA Cloud vector engine supports:

  • Storage of vectors alongside other enterprise data within the same SAP HANA Cloud database.

  • Utilization of SQL to interact with all types of data, including vectors.

  • Use of CRUD (Create, Read, Update, Delete) operations on vectors using SQL.

  • Integration of spatial, graph, JSON, and custom SQLScript with vector-based queries.

  • Implementation of vector use cases in solutions through SAP HANA Cloud clients (such as Python), the Python Machine Learning Client for SAP HANA (hana-ml), and the SAP Cloud Application Programming Model (CAP).

To store vector data, SAP HANA Cloud includes the built-in vector data type REAL_VECTOR, which consists of REAL elements (IEEE 754 single-precision floating-point). The dimensionality of a REAL_VECTOR column ranges from 1 to 65.000.

The data type REAL_VECTOR can be used like other SAP HANA SQL data types, but be aware of some current limitations:

  • Ordering: No order is defined on REAL_VECTOR. Operations relying on ordering (for example grouping, ordering, comparison) cannot be applied to vectors.

  • Arithmetic: REAL_VECTOR cannot be used in arithmetic expressions. For example, vector addition is not supported.

    Note

    Check the roadmap to see updates regarding vector functions:

    SAP Road Map Explorer - Support for additional vector functions

  • Table Types: REAL_VECTOR columns are not supported in row tables.

  • Partitioning: REAL_VECTOR columns cannot be used as partitioning keys when partitioning tables.

Vectorization is the process of taking different types of data (for example unstructured data like text or images) and converting them into numerical vectors.

Note

While vectorization is commonly associated with unstructured data like text or images, it can also be useful for structured data. For example, many machine learning models (such as neural networks or gradient-boosting models) work best with vectorized input.

Let's explore some examples of vectorization:

  • Text vectorization
  • Image vectorization

Text Vectorization

Text vectorization is the process of turning words and documents into mathematical representations. Textual data, comprising words, sentences, or documents, is inherently qualitative and unstructured. Vectorization algorithms transform this text into numerical vectors by encoding various linguistic features such as word frequency, word context, or word relationships. Some examples of techniques in text vectorization include:

  • BoW (bag-of-words): represents documents as vectors, where it treats a document as a "bag" of words, where the order or structure of the words doesn’t matter. Instead, the focus is on the presence or frequency of individual words.

  • TF-IDF (Term Frequency-Inverse Document Frequency): in contrast to simple word frequency, TF-IDF balances common and rare words so that the most meaningful terms are emphasized.

  • BM25: is an improvement over TF-IDF, in that it considers the length of the document and also dampens the effect of having many occurrences of a word in a document.

  • BERT (Bidirectional Encoder Representations from Transformers): is an example of a transformer. Transformers are a type of neural network architecture that have become a cornerstone in Natural Language Processing (NLP). It works to understand context in both directions – what comes before and after a word (called bidirectional context understanding). Essentially, it looks at the whole text simultaneously to grasp deeper meanings and context from the text.

Image Vectorization

Images can be represented as vectors, either by raw pixel values (for simpler tasks) or by more complex features extracted by models like convolutional neural networks (CNNs). These vectors encode the key features of the image, for tasks like image classification or similarity search.

Note

A convolutional neural network (CNN) is a network architecture for deep learning that learns directly from data. CNNs are particularly useful for finding patterns in images to recognize objects, classes, and categories.

Image vectorization can be useful for reverse product search. In this case, you can identify a product by its image, if you do not know the name or bar code. Another use case could be finding similar products.

Vectorization in SAP HANA Cloud

Let's explore vectorization in SAP HANA Cloud.

The VECTOR_EMBEDDING Function

SAP HANA Cloud offers the VECTOR_EMBEDDING function to create a vector embedding from a text. This function requires that Natural Language Processing (NLP) is enabled in SAP HANA Cloud.

The VECTOR_EMBEDDING function uses the following arguments:

  • The first argument is that the text has to be embedded.

  • The second argument specifies the type of the text.

    • 'DOCUMENT': For texts that are stored in the database and that must be searchable.

    • 'QUERY': For texts that are used in queries.

  • The third argument specifies which model and which version of the model are used. For example a model, which is based on transformer structure SAP_NEB.20240715.

    Note

    A list of available models and versions is available in VECTOR_EMBEDDING Function (Vector).

For example, the following query returns a vector embedding for the text "Hello World", using SAP_NEB.20240715:

Code Snippet
1
SELECT VECTOR_EMBEDDING('Hello world!', 'DOCUMENT', 'SAP_NEB.20240715') FROM DUMMY;
Screenshot of a database query interface showing the result of a vector embedding query for Hello world!. JSON view displays a list of numerical values as the embedding result for the input text.

When you double-click on the result row, all details of the vector are shown, which are many dimensions in this example.

The TO_REAL_VECTOR Function

In SAP HANA Cloud, you can use the TO_REAL_VECTOR function to construct vectors from:

  • A textual representation. For example:

    Code Snippet
    1
    SELECT TO_REAL_VECTOR('[1,2,3]') FROM DUMMY;
  • A binary representation. For example:

    Code Snippet
    1
    SELECT TO_REAL_VECTOR(x'030000000000803F0000004000004040') FROM DUMMY;
  • An array consisting of numerical elements. For example:

    Code Snippet
    1
    SELECT TO_REAL_VECTOR(ARRAY(1,2,3)) FROM DUMMY;
Three SQL query results using TO_REAL_VECTOR from DUMMY. The outputs are [1, 2, 3] in each case, with different data input formats: '(1,2,3)', hexadecimal and ARRAY(1,2,3).

It's also possible to serialize vectors. A vector can be serialized to the following:

  • A textual representation with TO_NVARCHAR or TO_NCLOB

  • A binary representation with TO_VARBINARY or TO_BLOB

  • An array consisting of REAL elements with TO_ARRAY

SQL query results showing conversion between data types using functions TO_NVARCHAR, TO_ARRAY, and TO_REAL_VECTOR on the input array [1,2,3]. Output shows different transformations.

Vector Data Storage in SAP HANA Cloud

Let's look at how vector data can be stored and maintained in SAP HANA Cloud.

Creating Tables with REAL_VECTOR Columns

Tables containing columns with the data type REAL_VECTOR are created similarly to regular tables.

The following example creates a table T1 with a column VECTOR_COLUMN (data type REAL_VECTOR), which can accept vectors of any dimension:

Code Snippet
1
CREATE TABLE T1 ( id INT PRIMARY KEY, VECTOR_COLUMN REAL_VECTOR );

For columns with data type REAL_VECTOR, you can specify an optional dimension constraint. This constraint ensures that only vectors with the specified dimension can be stored in these columns.

The following example creates a table T2 with a column VECTOR_COLUMN (data type REAL_VECTOR), that is constrained to accept only vectors with a dimension of 3:

Code Snippet
1
CREATE TABLE T2 ( id INT PRIMARY KEY, VECTOR_COLUMN REAL_VECTOR(3) );

Inserting Vectors

Vectors can be inserted in a manner similar to any other data type. However, it’s important to note that there are no implicit conversions to the REAL_VECTOR data type. Therefore, a vector must be explicitly constructed before it can be inserted into the database.

You can use import tools to upload vector embeddings into SAP HANA Cloud. The data type REAL_VECTOR supports importing and exporting using data files. The following data formats are supported:

  • CSV
  • Parquet

You can also use SQL commands in SAP HANA Cloud to construct and insert vector data.

Here is an example that demonstrates how to construct and insert three vectors into table T1. Notice that the TO_REAL_VECTOR function is used to construct the vector based on an array representation:

Code Snippet
123
INSERT INTO T1 (ID, VECTOR_COLUMN) VALUES (1, TO_REAL_VECTOR('[0, 40, 80]')); INSERT INTO T1 (ID, VECTOR_COLUMN) VALUES (2, TO_REAL_VECTOR('[20, 60, 80]')); INSERT INTO T1 (ID, VECTOR_COLUMN) VALUES (3, TO_REAL_VECTOR('[42, 73, 99]'));

Note

If many vectors need to be inserted into a column, it is recommended to use parameterized statements combined with batch inserts. For example, you can use Python. For information about using the hdbcli Python module, see Python Application Programming:

Python Application Programming

When texts are inserted in a table, the VECTOR_EMBEDDING function can be used to automatically generate embeddings. Let's have a look at the available options:

  • Create embeddings with generated columns.

  • Create embeddings using triggers.

Create Embeddings with Generated Columns

Let's look at the following example, where a table T1 is created, including column VECTOR_COLUMN (data type REAL_VECTOR):

Code Snippet
123456
CREATE TABLE T1 ( TITLE NVARCHAR(100), PARAGRAPH_ID INTEGER, PARAGRAPH NVARCHAR(5000), VECTOR_COLUMN REAL_VECTOR GENERATED ALWAYS AS VECTOR_EMBEDDING(PARAGRAPH, 'DOCUMENT', 'SAP_NEB.20240715') );

When data is inserted in this table, the values for column VECTOR_COLUMN are updated by the automatically generated (GENERATED ALWAYS AS) vector embeddings based on the values of column PARAGRAPH (NVARCHAR (5000)).

If the model you are using has a rather small token limit, then NVARCHAR(5000) columns should be sufficient.

If your texts are stored in NCLOB columns, embeddings have to be created using triggers.

Create Embeddings Using Triggers

Let's look at the following example, where a table T2 is created, including column VECTOR_COLUMN (data type REAL_VECTOR):

Code Snippet
123456
CREATE TABLE T2 ( TITLE NVARCHAR(100), PARAGRAPH_ID INTEGER, PARAGRAPH NCLOB, VECTOR_COLUMN REAL_VECTOR );

Let's now see how the triggers are defined:

Code Snippet
12345678
CREATE TRIGGER CREATE_MY_VECTOR_COLUMN BEFORE INSERT OR UPDATE OF PARAGRAPH ON T2 REFERENCING NEW ROW AS newrow ONLINE BEGIN newrow.VECTOR_COLUMN = VECTOR_EMBEDDING( :newrow.PARAGRAPH, 'DOCUMENT', 'SAP_NEB.20240715'); END;

The insertion of a new row triggers the vector embedding function and the values for column VECTOR_COLUMN are updated by the automatically generated vector embeddings based on the values of column PARAGRAPH (NCLOB).

Update and Delete Vectors

The following example shows how to update the vector with ID 1 in table T3:

Code Snippet
1
UPDATE T3 SET VECTOR_COLUMN = TO_REAL_VECTOR('[17,19,23]') WHERE ID = 1;

The following example shows how to delete the vector with ID 3 in table T3:

Code Snippet
1
DELETE FROM T3 WHERE ID = 3;

Memory Consumption and Sizing Considerations

Memory consumption and sizing are influenced by the dimension of the vector. The amount of memory in bytes, ( m ), consumed by a single vector ( v ) with dimension ( dim(v) ) can be computed using the following formula:

For instance, a vector with 1536 dimensions requires 8 + (4 * 1536) = 6152 bytes.

To compute the memory needed by a column, multiply the memory usage of a single vector by the number of rows in the column. For example, storing 1000000 vectors with 1536 dimensions requires (1000000 * 6152) = 6152000000 bytes, which is roughly 5,73 gigabytes of memory.

As a rule of thumb, the total memory of an SAP HANA system should be at least twice the memory occupied by the data.

Use Case Scenarios

Use Cases

Let's look at some use case scenarios of the SAP HANA Cloud vector engine:

  • Semantic search and retrieval: apply vector embeddings to enhance semantic search capabilities, enabling users to find relevant information quickly.

  • Contextual analysis: use vector embeddings for contextual analysis, allowing for a deeper understanding of relationships between different data points within the database.

  • Intelligent data applications: use vector embeddings to build intelligent data applications, unlocking insights and facilitating more informed decision-making.

  • Enhanced recommendations: apply vector embeddings to improve recommendation systems, providing users with more accurate and personalized suggestions.

  • Optimized Large Language Models (LLMs): enhance the output of LLMs by using vector embeddings to optimize and add context to the generated content.

Large Language Models and Retrieval-Augmented Generation

Large Language Models (LLMs) are a category of foundation models trained on immense amounts of data making them capable of understanding and generating natural language and other types of content to perform a wide range of tasks.

LLMs models offer amazing opportunities, but they also have limitations. For example, LLMs can rely on outdated training data and lack company-specific data and business process context. An LLM can only work with the initial training data.

Retrieval-Augmented Generation (RAG) is the process of optimizing the output of a Large Language Model, so it references an authoritative knowledge base outside of its training data sources before generating a response. RAG extends the capabilities of LLMs to specific domains or an organization's internal knowledge base, all without the need to retrain the model. It is a cost-effective approach to improving LLM output so it remains relevant, accurate, and useful in various contexts.

The SAP HANA Cloud vector engine enables businesses to combine the power of LLMs with company-specific, real-time data and business process know-how, all integrated in SAP HANA Cloud. The SAP HANA Cloud vector engine can provide LLMs with relevant data of an organization through RAG.

Similarity Measures

While directly comparing the similarity of two raw objects can be complex and computationally intensive, similarity measures make it straightforward and relatively inexpensive to compare vectors.

A similarity measure is a mathematical function that calculates a single numerical value from two vectors. There are many metrics that you can use to compute and measure similarities between entities of text. SAP HANA Cloud provides two built-in similarity measures for working with vector data:

  • L2 Distance
  • Cosine Similarity

L2 Distance

L2 Distance measures the Euclidean Distance between vectors.

Euclidean Distance is the most common distance measure used to determine the distance between two data points in a multidimensional space. Determining the distance between two data points can be very telling of the relationship they share. The result is a value greater or equal to 0. A relatively short or close distance could imply that the two data points are similar (the smaller the distance, the higher the similarity).

Euclidean distance depends on the vector’s magnitude, which is influenced by occurrence counts and heterogeneity of word neighborhoods.

This measure is particularly useful for applications that involve high-dimensional data, such as machine learning, spatial and geometric computations, data mining, and recommendation systems where similarity or proximity between data points is important.

The syntax of the L2 Distance function is like this:

Code Snippet
1
L2DISTANCE(vector1, vector2)

Note

Both vectors must have the same dimension.

Let's look at the following example:

A 3D graph shows vectors in different colors with coordinates listed in a table. Vector IDs range from 1 to 10, representing points like [1,1,1], [4,4,2], and [6,3,1].

The image shows 10 colored dots in a 3-dimensional space. The coordinates of the dots are shown in the table on the right.

Using the L2 Distance function you can calculate the Euclidean distance between vector 1 and vector 2:

Code Snippet
1
SELECT L2DISTANCE( TO_REAL_VECTOR('[1, 1, 1]'), TO_REAL_VECTOR('[4, 4, 2]')) FROM DUMMY;

The result of this function is 4.358898943540674.

A database query calculates the L2 distance between vectors [1, 1, 1] and [4, 4, 2], yielding the result 4.35889943540674 displayed in the output row.

Cosine Similarity

Cosine Similarity is a metric, helpful in determining how similar the data objects are irrespective of their size. The Cosine Similarity does not depend on the magnitudes of the vectors, but only on their angle (the smaller the angle, the higher the similarity).

It is defined as the cosine of the angle between them, which in turn is calculated as the dot product of the vectors divided by the product of their magnitudes. The result is a value between -1 and 1, where a value of:

  • 1 means that the vectors are identical.

  • 0 means that the vectors are orthogonal (at a 90-degree angle to each other).

  • -1 means that the vectors are diametrically opposed.

This measure is particularly useful in text mining, information retrieval, and machine learning.

The syntax of the Cosine Similarity function is like this:

Code Snippet
1
COSINE_SIMILARITY(vector1, vector2)

Note

Both vectors must have the same dimensionality (number of elements).

Let's look at the following example:

3D plot showing ten vectors in various directions with a table listing corresponding vector IDs and coordinates.

The image shows 10 colored vectors in a 3-dimensional space. The coordinates of these vectors are shown in the table on the right.

Using the Cosine Similarity function you can calculate the similarity between vector 1 and vector 2:

Code Snippet
1
SELECT COSINE_SIMILARITY( TO_REAL_VECTOR('[1,1,1]'), TO_REAL_VECTOR('[4,4,2]')) FROM DUMMY;

The result of this function is 0.9622504486493763.

SQL query calculates cosine similarity between two real vectors [1,1,1] and [4,4,2]. The result displayed is 0.9622504486493763.

Text Similarity Example

Let's now look at a text similarity scenario where the SAP HANA Cloud vector engine can be used.

In this example, the Vector Embedding function and the Cosine Similarity function are used to determine which wireless mouse is the closest alternative to a ‘3M wireless mouse', according to its description.

In SAP HANA database explorer, the following steps are taken:

  1. Display the contents of table PRODUCT_VECTORS to notice the values for columns PRODUCT_ID, PRODUCT_NAME, and DESCRIPTION.

    SQL Statement:

    Code Snippet
    12345
    SELECT "PRODUCT_ID", "PRODUCT_NAME", "DESCRIPTION" FROM "PRODUCT_VECTORS";
  2. Add a column VECTOR_COLUMN (data type REAL_VECTOR) to table PRODUCT_VECTORS.

    SQL Statement:

    Code Snippet
    1
    ALTER TABLE "PRODUCT_VECTORS" ADD (VECTOR_COLUMN REAL_VECTOR);
  3. Create vector embeddings using column DESCRIPTION and update column VECTOR_COLUMN with them.

    SQL Statement:

    Code Snippet
    1
    UPDATE "PRODUCT_VECTORS" SET VECTOR_COLUMN = VECTOR_EMBEDDING( DESCRIPTION, 'DOCUMENT', 'SAP_NEB.20240715' );
  4. Display the contents of table PRODUCT_VECTORS to notice the values for PRODUCT_ID, PRODUCT_NAME, DESCRIPTION, and VECTOR_COLUMN.

    SQL Statement:

    Code Snippet
    123456
    SELECT TOP 1000 "PRODUCT_ID", "PRODUCT_NAME", "DESCRIPTION", "VECTOR_COLUMN" FROM "PRODUCT_VECTORS"
  5. Execute a query to find the top 5 products that are most similar to a ‘3M wireless mouse’ based on their vector representations. Use the Cosine Similarity function and order the results descending to have the best match on top of the list.

    SQL Statement:

    Code Snippet
    1234
    SELECT TOP 5 PRODUCT_ID, PRODUCT_NAME, DESCRIPTION, COSINE_SIMILARITY(VECTOR_EMBEDDING( '3M wireless mouse', 'QUERY', 'SAP_NEB.20240715' ), VECTOR_COLUMN) AS "SIMILARITY" FROM "PRODUCT_VECTORS" ORDER BY "SIMILARITY" DESC;

Watch the following video to see how these steps are performed in the SAP HANA database explorer.

Log in to track your progress & complete quizzes