ngrams
to
4-shingles which are used
as input for creating a set digest of each initial text. The set digests
are compared to each other to get an approximation of the similarity of
their corresponding initial texts:
id1 | id2 | intersection_cardinality | jaccard_index |
---|---|---|---|
1 | 2 | 0 | 0.0 |
1 | 3 | 4 | 0.6 |
2 | 3 | 0 | 0.0 |
setdigest
. Trino
offers the ability to merge multiple Set Digest data sketches.
varbinary
.
This allows them to be stored for later use.
make_set_digest(x)
→ setdigest
Composes all input values of x
into a setdigest
.
Create a setdigest
corresponding to a bigint
array:
setdigest
corresponding to a varchar
array:
SELECT make_set_digest(value)
FROM (VALUES ‘Trino’, ‘SQL’, ‘on’, ‘everything’) T(value);
merge_set_digest(setdigest, setdigest)
→ setdigest
Returns the setdigest
of the aggregate union of the individual
setdigest
Set Digest structures.
Returns the cardinality of the set digest from its internal
HyperLogLog
component.
Examples:
intersection_cardinality(x,y)
→ long
Returns the estimation for the cardinality of the intersection of the
two set digests.
x
and y
must be of type setdigest
Examples:
jaccard_index(x, y)
→ double
Returns the estimation of Jaccard
index for the two set
digests.
x
and y
must be of type setdigest
.
Examples:
hash_counts(x)
→ map(bigint, bigint)
Returns a map containing the
Murmur3Hash128
hashed values and the count of their occurences within the internal
MinHash
structure belonging to x
.
x
must be of type setdigest
.
Examples: