mlpack
master
|
The LSHSearch class; this class builds a hash on the reference set and uses this hash to compute the distance-approximate nearest-neighbors of the given queries. More...
Classes | |
struct | CandidateCmp |
Compare two candidates based on the distance. More... | |
Public Member Functions | |
LSHSearch (const arma::mat &referenceSet, const arma::cube &projections, const double hashWidth=0.0, const size_t secondHashSize=99901, const size_t bucketSize=500) | |
This function initializes the LSH class. More... | |
LSHSearch (const arma::mat &referenceSet, const size_t numProj, const size_t numTables, const double hashWidth=0.0, const size_t secondHashSize=99901, const size_t bucketSize=500) | |
This function initializes the LSH class. More... | |
LSHSearch () | |
Create an untrained LSH model. More... | |
LSHSearch (const LSHSearch &other) | |
Copy the given LSH model. More... | |
LSHSearch (LSHSearch &&other) | |
Take ownership of the given LSH model. More... | |
~LSHSearch () | |
Clean memory. More... | |
size_t | BucketSize () const |
Get the bucket size of the second hash. More... | |
size_t | DistanceEvaluations () const |
Return the number of distance evaluations performed. More... | |
size_t & | DistanceEvaluations () |
Modify the number of distance evaluations performed. More... | |
size_t | NumProjections () const |
Get the number of projections. More... | |
const arma::mat & | Offsets () const |
Get the offsets 'b' for each of the projections. (One 'b' per column.) More... | |
LSHSearch & | operator= (const LSHSearch &other) |
Copy the given LSH model. More... | |
LSHSearch & | operator= (LSHSearch &&other) |
Take ownership of the given LSH model. More... | |
const arma::cube & | Projections () |
Get the projection tables. More... | |
void | Projections (const arma::cube &projTables) |
Change the projection tables (this retrains the LSH model). More... | |
const arma::mat & | ReferenceSet () const |
Return the reference dataset. More... | |
void | Search (const arma::mat &querySet, const size_t k, arma::Mat< size_t > &resultingNeighbors, arma::mat &distances, const size_t numTablesToSearch=0, const size_t T=0) |
Compute the nearest neighbors of the points in the given query set and store the output in the given matrices. More... | |
void | Search (const size_t k, arma::Mat< size_t > &resultingNeighbors, arma::mat &distances, const size_t numTablesToSearch=0, size_t T=0) |
Compute the nearest neighbors and store the output in the given matrices. More... | |
const std::vector< arma::Col< size_t > > & | SecondHashTable () const |
Get the second hash table. More... | |
const arma::vec & | SecondHashWeights () const |
Get the weights of the second hash. More... | |
template<typename Archive > | |
void | Serialize (Archive &ar, const unsigned int version) |
Serialize the LSH model. More... | |
void | Train (const arma::mat &referenceSet, const size_t numProj, const size_t numTables, const double hashWidth=0.0, const size_t secondHashSize=99901, const size_t bucketSize=500, const arma::cube &projection=arma::cube()) |
Train the LSH model on the given dataset. More... | |
Static Public Member Functions | |
static double | ComputeRecall (const arma::Mat< size_t > &foundNeighbors, const arma::Mat< size_t > &realNeighbors) |
Compute the recall (% of neighbors found) given the neighbors returned by LSHSearch::Search and a "ground truth" set of neighbors. More... | |
Private Types | |
typedef std::pair< double, size_t > | Candidate |
Candidate represents a possible candidate neighbor (distance, index). More... | |
typedef std::priority_queue< Candidate, std::vector< Candidate >, CandidateCmp > | CandidateList |
Use a priority queue to represent the list of candidate neighbors. More... | |
Private Member Functions | |
void | BaseCase (const size_t queryIndex, const arma::uvec &referenceIndices, const size_t k, arma::Mat< size_t > &neighbors, arma::mat &distances) const |
This is a helper function that computes the distance of the query to the neighbor candidates and appropriately stores the best 'k' candidates. More... | |
void | BaseCase (const size_t queryIndex, const arma::uvec &referenceIndices, const size_t k, const arma::mat &querySet, arma::Mat< size_t > &neighbors, arma::mat &distances) const |
This is a helper function that computes the distance of the query to the neighbor candidates and appropriately stores the best 'k' candidates. More... | |
void | GetAdditionalProbingBins (const arma::vec &queryCode, const arma::vec &queryCodeNotFloored, const size_t T, arma::mat &additionalProbingBins) const |
This function implements the core idea behind Multiprobe LSH. More... | |
bool | PerturbationExpand (std::vector< bool > &A) const |
Inline function used by GetAdditionalProbingBins. More... | |
double | PerturbationScore (const std::vector< bool > &A, const arma::vec &scores) const |
Returns the score of a perturbation vector generated by perturbation set A. More... | |
bool | PerturbationShift (std::vector< bool > &A) const |
Inline function used by GetAdditionalProbingBins. More... | |
bool | PerturbationValid (const std::vector< bool > &A) const |
Return true if perturbation set A is valid. More... | |
template<typename VecType > | |
void | ReturnIndicesFromTable (const VecType &queryPoint, arma::uvec &referenceIndices, size_t numTablesToSearch, const size_t T) const |
This function takes a query and hashes it into each of the hash tables to get keys for the query and then the key is hashed to a bucket of the second hash table and all the points (if any) in those buckets are collected as the potential neighbor candidates. More... | |
Private Attributes | |
arma::Col< size_t > | bucketContentSize |
The number of elements present in each hash bucket; should be secondHashSize. More... | |
arma::Col< size_t > | bucketRowInHashTable |
For a particular hash value, points to the row in secondHashTable corresponding to this value. More... | |
size_t | bucketSize |
The bucket size of the second hash. More... | |
size_t | distanceEvaluations |
The number of distance evaluations. More... | |
double | hashWidth |
The hash width. More... | |
size_t | numProj |
The number of projections. More... | |
size_t | numTables |
The number of hash tables. More... | |
arma::mat | offsets |
The list of the offsets 'b' for each of the projection for each table. More... | |
bool | ownsSet |
If true, we own the reference set. More... | |
arma::cube | projections |
The arma::cube containing the projection matrix of each table. More... | |
const arma::mat * | referenceSet |
Reference dataset. More... | |
size_t | secondHashSize |
The big prime representing the size of the second hash. More... | |
std::vector< arma::Col< size_t > > | secondHashTable |
The final hash table; should be (< secondHashSize) vectors each with (<= bucketSize) elements. More... | |
arma::vec | secondHashWeights |
The weights of the second hash. More... | |
The LSHSearch class; this class builds a hash on the reference set and uses this hash to compute the distance-approximate nearest-neighbors of the given queries.
SortPolicy | The sort policy for distances; see NearestNeighborSort. |
Definition at line 62 of file lsh_search.hpp.
|
private |
Candidate represents a possible candidate neighbor (distance, index).
Definition at line 460 of file lsh_search.hpp.
|
private |
Use a priority queue to represent the list of candidate neighbors.
Definition at line 472 of file lsh_search.hpp.
mlpack::neighbor::LSHSearch< SortPolicy >::LSHSearch | ( | const arma::mat & | referenceSet, |
const arma::cube & | projections, | ||
const double | hashWidth = 0.0 , |
||
const size_t | secondHashSize = 99901 , |
||
const size_t | bucketSize = 500 |
||
) |
This function initializes the LSH class.
It builds the hash on the reference set with 2-stable distributions. See the individual functions performing the hashing for details on how the hashing is done.
referenceSet | Set of reference points and the set of queries. |
projections | Cube of projection tables. For a cube of size (a, b, c) we set numProj = a, numTables = c. b is the reference set dimensionality. |
hashWidth | The width of hash for every table. If 0 (the default) is provided, then the hash width is automatically obtained by computing the average pairwise distance of 25 pairs. This should be a reasonable upper bound on the nearest-neighbor distance in general. |
secondHashSize | The size of the second hash table. This should be a large prime number. |
bucketSize | The size of the bucket in the second hash table. This is the maximum number of points that can be hashed into single bucket. A value of 0 indicates that there is no limit (so the second hash table can be arbitrarily large—be careful!). |
mlpack::neighbor::LSHSearch< SortPolicy >::LSHSearch | ( | const arma::mat & | referenceSet, |
const size_t | numProj, | ||
const size_t | numTables, | ||
const double | hashWidth = 0.0 , |
||
const size_t | secondHashSize = 99901 , |
||
const size_t | bucketSize = 500 |
||
) |
This function initializes the LSH class.
It builds the hash one the reference set using the provided projections. See the individual functions performing the hashing for details on how the hashing is done.
referenceSet | Set of reference points and the set of queries. |
numProj | Number of projections in each hash table (anything between 10-50 might be a decent choice). |
numTables | Total number of hash tables (anything between 10-20 should suffice). |
hashWidth | The width of hash for every table. If 0 (the default) is provided, then the hash width is automatically obtained by computing the average pairwise distance of 25 pairs. This should be a reasonable upper bound on the nearest-neighbor distance in general. |
secondHashSize | The size of the second hash table. This should be a large prime number. |
bucketSize | The size of the bucket in the second hash table. This is the maximum number of points that can be hashed into single bucket. A value of 0 indicates that there is no limit (so the second hash table can be arbitrarily large—be careful!). |
mlpack::neighbor::LSHSearch< SortPolicy >::LSHSearch | ( | ) |
mlpack::neighbor::LSHSearch< SortPolicy >::LSHSearch | ( | const LSHSearch< SortPolicy > & | other | ) |
Copy the given LSH model.
other | Other LSH model to copy. |
mlpack::neighbor::LSHSearch< SortPolicy >::LSHSearch | ( | LSHSearch< SortPolicy > && | other | ) |
Take ownership of the given LSH model.
other | Other LSH model to take ownership of. |
mlpack::neighbor::LSHSearch< SortPolicy >::~LSHSearch | ( | ) |
Clean memory.
|
private |
This is a helper function that computes the distance of the query to the neighbor candidates and appropriately stores the best 'k' candidates.
This is specific to the monochromatic search case, where the query set is the reference set.
queryIndex | The index of the query in question |
referenceIndices | The vector of indices of candidate neighbors for the query. |
k | Number of neighbors to search for. |
neighbors | Matrix holding output neighbors. |
distances | Matrix holding output distances. |
Referenced by mlpack::neighbor::LSHSearch< SortPolicy >::Projections().
|
private |
This is a helper function that computes the distance of the query to the neighbor candidates and appropriately stores the best 'k' candidates.
This is specific to bichromatic search, where the query set is not the same as the reference set.
queryIndex | The index of the query in question |
referenceIndices | The vector of indices of candidate neighbors for the query. |
k | Number of neighbors to search for. |
querySet | Set of query points. |
neighbors | Matrix holding output neighbors. |
distances | Matrix holding output distances. |
|
inline |
Get the bucket size of the second hash.
Definition at line 280 of file lsh_search.hpp.
References mlpack::neighbor::LSHSearch< SortPolicy >::bucketSize.
|
static |
Compute the recall (% of neighbors found) given the neighbors returned by LSHSearch::Search and a "ground truth" set of neighbors.
The recall returned will be in the range [0, 1].
foundNeighbors | Set of neighbors to compute recall of. |
realNeighbors | Set of "ground truth" neighbors to compute recall against. |
|
inline |
Return the number of distance evaluations performed.
Definition at line 263 of file lsh_search.hpp.
References mlpack::neighbor::LSHSearch< SortPolicy >::distanceEvaluations.
|
inline |
Modify the number of distance evaluations performed.
Definition at line 265 of file lsh_search.hpp.
References mlpack::neighbor::LSHSearch< SortPolicy >::distanceEvaluations.
|
private |
This function implements the core idea behind Multiprobe LSH.
It is called by ReturnIndicesFromTables when T > 0. Given a query's code and its projection location, GetAdditionalProbingBins will calculate the T most likely alternative bin codes (other than queryCode) where a query's neighbors might be found in.
queryCode | vector containing the numProj-dimensional query code. |
queryCodeNotFloored | vector containing the projection location of the query. |
T | number of additional probing bins. |
additionalProbingBins | matrix. Each column will hold one additional bin. |
Referenced by mlpack::neighbor::LSHSearch< SortPolicy >::Projections().
|
inline |
Get the number of projections.
Definition at line 271 of file lsh_search.hpp.
|
inline |
Get the offsets 'b' for each of the projections. (One 'b' per column.)
Definition at line 274 of file lsh_search.hpp.
References mlpack::neighbor::LSHSearch< SortPolicy >::offsets.
LSHSearch& mlpack::neighbor::LSHSearch< SortPolicy >::operator= | ( | const LSHSearch< SortPolicy > & | other | ) |
Copy the given LSH model.
other | Other LSH model to copy. |
LSHSearch& mlpack::neighbor::LSHSearch< SortPolicy >::operator= | ( | LSHSearch< SortPolicy > && | other | ) |
Take ownership of the given LSH model.
other | Other LSH model to take ownership of. |
|
private |
Inline function used by GetAdditionalProbingBins.
The vector expansion operation adds the element [1 + (largest_element)] to a vector A, where largest_element is the largest element of A. Returns true if resulting vector is valid, otherwise false.
A | perturbation set to expand. |
Referenced by mlpack::neighbor::LSHSearch< SortPolicy >::Projections().
|
private |
Returns the score of a perturbation vector generated by perturbation set A.
The score of a pertubation set (vector) is the sum of scores of the participating actions.
A | perturbation set to compute the score of. |
scores | vector containing score of each perturbation. |
Referenced by mlpack::neighbor::LSHSearch< SortPolicy >::Projections().
|
private |
Inline function used by GetAdditionalProbingBins.
The vector shift operation replaces the largest element of a vector A with (largest element)
A | perturbation set to shift. |
Referenced by mlpack::neighbor::LSHSearch< SortPolicy >::Projections().
|
private |
Return true if perturbation set A is valid.
A perturbation set is invalid if it contains two (or more) actions for the same dimension or dimensions that are larger than the queryCode's dimensions.
A | perturbation set to validate. |
Referenced by mlpack::neighbor::LSHSearch< SortPolicy >::Projections().
|
inline |
Get the projection tables.
Definition at line 287 of file lsh_search.hpp.
References mlpack::neighbor::LSHSearch< SortPolicy >::projections.
|
inline |
Change the projection tables (this retrains the LSH model).
Definition at line 290 of file lsh_search.hpp.
References mlpack::neighbor::LSHSearch< SortPolicy >::BaseCase(), mlpack::neighbor::LSHSearch< SortPolicy >::bucketSize, mlpack::neighbor::LSHSearch< SortPolicy >::GetAdditionalProbingBins(), mlpack::neighbor::LSHSearch< SortPolicy >::hashWidth, mlpack::neighbor::LSHSearch< SortPolicy >::PerturbationExpand(), mlpack::neighbor::LSHSearch< SortPolicy >::PerturbationScore(), mlpack::neighbor::LSHSearch< SortPolicy >::PerturbationShift(), mlpack::neighbor::LSHSearch< SortPolicy >::PerturbationValid(), mlpack::neighbor::LSHSearch< SortPolicy >::ReturnIndicesFromTable(), mlpack::neighbor::LSHSearch< SortPolicy >::secondHashSize, and mlpack::neighbor::LSHSearch< SortPolicy >::Train().
|
inline |
Return the reference dataset.
Definition at line 268 of file lsh_search.hpp.
References mlpack::neighbor::LSHSearch< SortPolicy >::referenceSet.
|
private |
This function takes a query and hashes it into each of the hash tables to get keys for the query and then the key is hashed to a bucket of the second hash table and all the points (if any) in those buckets are collected as the potential neighbor candidates.
queryPoint | The query point currently being processed. |
referenceIndices | The list of neighbor candidates obtained from hashing the query into all the hash tables and eventually into multiple buckets of the second hash table. |
numTablesToSearch | The number of tables to perform the search in. If 0, all tables are searched. |
T | The number of additional probing bins for multiprobe LSH. If 0, single-probe is used. |
Referenced by mlpack::neighbor::LSHSearch< SortPolicy >::Projections().
void mlpack::neighbor::LSHSearch< SortPolicy >::Search | ( | const arma::mat & | querySet, |
const size_t | k, | ||
arma::Mat< size_t > & | resultingNeighbors, | ||
arma::mat & | distances, | ||
const size_t | numTablesToSearch = 0 , |
||
const size_t | T = 0 |
||
) |
Compute the nearest neighbors of the points in the given query set and store the output in the given matrices.
The matrices will be set to the size of n columns by k rows, where n is the number of points in the query dataset and k is the number of neighbors being searched for.
querySet | Set of query points. |
k | Number of neighbors to search for. |
resultingNeighbors | Matrix storing lists of neighbors for each query point. |
distances | Matrix storing distances of neighbors for each query point. |
numTablesToSearch | This parameter allows the user to have control over the number of hash tables to be searched. This allows the user to pick the number of tables it can afford for the time available without having to build hashing for every table size. By default, this is set to zero in which case all tables are considered. |
T | The number of additional probing bins to examine with multiprobe LSH. If T = 0, classic single-probe LSH is run (default). |
void mlpack::neighbor::LSHSearch< SortPolicy >::Search | ( | const size_t | k, |
arma::Mat< size_t > & | resultingNeighbors, | ||
arma::mat & | distances, | ||
const size_t | numTablesToSearch = 0 , |
||
size_t | T = 0 |
||
) |
Compute the nearest neighbors and store the output in the given matrices.
The matrices will be set to the size of n columns by k rows, where n is the number of points in the query dataset and k is the number of neighbors being searched for.
k | Number of neighbors to search for. |
resultingNeighbors | Matrix storing lists of neighbors for each query point. |
distances | Matrix storing distances of neighbors for each query point. |
numTablesToSearch | This parameter allows the user to have control over the number of hash tables to be searched. This allows the user to pick the number of tables it can afford for the time available without having to build hashing for every table size. By default, this is set to zero in which case all tables are considered. |
|
inline |
Get the second hash table.
Definition at line 283 of file lsh_search.hpp.
References mlpack::neighbor::LSHSearch< SortPolicy >::secondHashTable.
|
inline |
Get the weights of the second hash.
Definition at line 277 of file lsh_search.hpp.
References mlpack::neighbor::LSHSearch< SortPolicy >::secondHashWeights.
void mlpack::neighbor::LSHSearch< SortPolicy >::Serialize | ( | Archive & | ar, |
const unsigned int | version | ||
) |
Serialize the LSH model.
ar | Archive to serialize to. |
void mlpack::neighbor::LSHSearch< SortPolicy >::Train | ( | const arma::mat & | referenceSet, |
const size_t | numProj, | ||
const size_t | numTables, | ||
const double | hashWidth = 0.0 , |
||
const size_t | secondHashSize = 99901 , |
||
const size_t | bucketSize = 500 , |
||
const arma::cube & | projection = arma::cube() |
||
) |
Train the LSH model on the given dataset.
If a correctly-sized projection cube is not provided, this means building new hash tables. Otherwise, we use the projections provided by the user.
referenceSet | Set of reference points and the set of queries. |
numProj | Number of projections in each hash table (anything between 10-50 might be a decent choice). |
numTables | Total number of hash tables (anything between 10-20 should suffice). |
hashWidth | The width of hash for every table. If 0 (the default) is provided, then the hash width is automatically obtained by computing the average pairwise distance of 25 pairs. This should be a reasonable upper bound on the nearest-neighbor distance in general. |
secondHashSize | The size of the second hash table. This should be a large prime number. |
bucketSize | The size of the bucket in the second hash table. This is the maximum number of points that can be hashed into single bucket. A value of 0 indicates that there is no limit (so the second hash table can be arbitrarily large—be careful!). |
projections | Cube of projection tables. For a cube of size (a, b, c) we set numProj = a, numTables = c. b is the reference set dimensionality. |
Referenced by mlpack::neighbor::LSHSearch< SortPolicy >::Projections().
|
private |
The number of elements present in each hash bucket; should be secondHashSize.
Definition at line 450 of file lsh_search.hpp.
|
private |
For a particular hash value, points to the row in secondHashTable corresponding to this value.
Length secondHashSize.
Definition at line 454 of file lsh_search.hpp.
|
private |
The bucket size of the second hash.
Definition at line 442 of file lsh_search.hpp.
Referenced by mlpack::neighbor::LSHSearch< SortPolicy >::BucketSize(), and mlpack::neighbor::LSHSearch< SortPolicy >::Projections().
|
private |
The number of distance evaluations.
Definition at line 457 of file lsh_search.hpp.
Referenced by mlpack::neighbor::LSHSearch< SortPolicy >::DistanceEvaluations().
|
private |
The hash width.
Definition at line 433 of file lsh_search.hpp.
Referenced by mlpack::neighbor::LSHSearch< SortPolicy >::Projections().
|
private |
The number of projections.
Definition at line 422 of file lsh_search.hpp.
|
private |
The number of hash tables.
Definition at line 424 of file lsh_search.hpp.
|
private |
The list of the offsets 'b' for each of the projection for each table.
Definition at line 430 of file lsh_search.hpp.
Referenced by mlpack::neighbor::LSHSearch< SortPolicy >::Offsets().
|
private |
If true, we own the reference set.
Definition at line 419 of file lsh_search.hpp.
|
private |
The arma::cube containing the projection matrix of each table.
Definition at line 427 of file lsh_search.hpp.
Referenced by mlpack::neighbor::LSHSearch< SortPolicy >::Projections().
|
private |
Reference dataset.
Definition at line 417 of file lsh_search.hpp.
Referenced by mlpack::neighbor::LSHSearch< SortPolicy >::ReferenceSet().
|
private |
The big prime representing the size of the second hash.
Definition at line 436 of file lsh_search.hpp.
Referenced by mlpack::neighbor::LSHSearch< SortPolicy >::Projections().
|
private |
The final hash table; should be (< secondHashSize) vectors each with (<= bucketSize) elements.
Definition at line 446 of file lsh_search.hpp.
Referenced by mlpack::neighbor::LSHSearch< SortPolicy >::SecondHashTable().
|
private |
The weights of the second hash.
Definition at line 439 of file lsh_search.hpp.
Referenced by mlpack::neighbor::LSHSearch< SortPolicy >::SecondHashWeights().