PQk-means is a Python library for efficient clustering of large-scale data. While k-means clustering is slow and not much efficient to handle large scale data, PQk-means is an efficient clustering method for billion-scale feature vectors.

In terms of PQk-means, it achieves its speed and efficiency by first compressing input vectors into short product-quantized (PQ) codes.

Requisites

CMake

brew install cmake for OS X

for OS X sudo apt install cmake for Ubuntu

OpenMP (Optional)

If openmp is installed, it will be automatically used to parallelize the algorithm for faster calculation.

Build & install

You can install the library from PyPI:

pip install pqkmeans

Or, if you would like to use the current master version, you can manually build and install the library by:

git clone --recursive https://github.com/DwangoMediaVillage/pqkmeans.git cd pqkmeans python setup.py install

Run samples

# with artificial data python bin/run_experiment.py --dataset artificial --algorithm bkmeans pqkmeans --k 100 # with texmex dataset (http://corpus-texmex.irisa.fr/) python bin/run_experiment.py --dataset siftsmall --algorithm bkmeans pqkmeans --k 100

Test

python setup.py test

Usage

For PQk-means

import pqkmeans import numpy as np X = np.random.random((100000, 128)) # 128 dimensional 100,000 samples # Train a PQ encoder. # Each vector is divided into 4 parts and each part is # encoded with log256 = 8 bit, resulting in a 32 bit PQ code. encoder = pqkmeans.encoder.PQEncoder(num_subdim=4, Ks=256) encoder.fit(X[:1000]) # Use a subset of X for training # Convert input vectors to 32-bit PQ codes, where each PQ code consists of four uint8. # You can train the encoder and transform the input vectors to PQ codes preliminary. X_pqcode = encoder.transform(X) # Run clustering with k=5 clusters. kmeans = pqkmeans.clustering.PQKMeans(encoder=encoder, k=5) clustered = kmeans.fit_predict(X_pqcode) # Then, clustered[0] is the id of assigned center for the first input PQ code (X_pqcode[0]).

More details at: Github

Github: https://github.com/DwangoMediaVillage/pqkmeans

Paper: https://arxiv.org/pdf/1709.03708.pdf

Project: http://yusukematsui.me/project/pqkmeans/pqkmeans.html

Tutorial: https://github.com/DwangoMediaVillage/pqkmeans/tree/master/tutorial