Approximative Linear t-SNE using k-Means

Sonja Biedermann

doi:10.25365/thesis.69604

You are here: University of Vienna PHAIDRA Detail o:1396227

Title (eng)

Approximative Linear t-SNE using k-Means

Parallel title (deu)

Approximatives Lineares t-SNE mit k-Means

Author

Sonja Biedermann

Advisor

Claudia Plant

Assessor

Claudia Plant

Abstract (deu)

Unter Dimensionsreduktion versteht man die Transformation von hochdimensionalen Daten in einen niedrigdimensionalen Raum, ohne die wesentlichen Merkmale der Daten zu verlieren. Idealerweise gibt das niedrigdimensionale sogenannte Embedding seine Struktur leichter Preis als sein hochdimensionales Gegenstück. Es gibt viele unterschiedliche Ansätze zur Dimensionsreduktion. Die häufigste Unterscheidungsmethode ist zwischen linearer Dimensionsreduktion und nonlinearer Dimensionsreduktion. Lineare Dimensionsreduktionsmethoden transformieren die Daten in einen niedrigdimensionalen Raum nur unter Verwendung von linearen Transformationen, wohingehen nonlineare Methoden die Daten jegliche Art von Transformation anwenden dürfen. In dieser Thesis beschäftigen wir uns mit einer nonlinearen Methode namens t-SNE, die eine Variante von Stochastic Neighbor Embedding (SNE) unter Verwendung der Student-t Verteilung darstellt. $t$-SNE ist beliebt aufgrund der visuell ansprechenden Embeddings, die Clusterings ähneln, die es produziert. Allerdings ist sie auch stochastisch, nonparametrisch, verfügt über eine schwierig zu optimierende Zielfunktion und hat zudem noch eine hohe Laufzeitkomplexität von O(n^2). Wir implementieren und evaluieren einer variante von t-SNE namens kt-SNE, die den Clusteringalgorithmus k-Means und Locality Sensitive Hashing verwendet. Wir erreichen eine Laufzeitkomplexität von O(n). Wir evaluieren unsere Methode gründlich und vergleichen sie mit drei vergleichbaren Methoden. Unsere Methode erreicht Resultate die zumeist innerhalb von 10% der besten Methode auf realen Daten liegen und läuft signifikant schneller als Barnes-Hut t-SNE.

Abstract (eng)

Dimensionality reduction is the transformation of high-dimensional data into a lower dimensional space while keeping the interesting parts of the data intact. Ideally, the low-dimensional so-called embedding reveals structure that would have been much more difficult to detect in the high-dimensional space. There are many different approaches to dimensionality reduction, and the most common taxonomy separates them into linear and non-linear methods. Linear methods transform the high-dimensional data solely by applying linear transformations, such as PCA, while non-linear methods are essentially free to transform the data by any means. In this thesis, we will be focusing on a non-linear method called t-SNE, which is a variant of Stochastic Neighborhood Embedding using the Student-t distribution. t-SNE has become popular due to the visual properties of the embeddings it produces: well-separated clusters consisting of similar items. It is, however, a stochastic method, non-parametric in its original form, possesses a difficult to optimize non-convex objective function and also has a prohibitively high runtime complexity of O(n^2). We implement and evaluate a variant of t-SNE named kt-SNE utilizing k-Means and locality sensitive hashing. We achieve an overall runtime complexity of O(n). We thoroughly evaluate our method and put it to the test against three competing methods. Our method achieves results that are mostly within 10% of the best result of the competing methods on real data and runs significantly faster than Barnes-Hut t-SNE.

Keywords (eng)

dimensionality reductiont-SNEvisualizationk-Meansapproximation

Keywords (deu)

Dimensionsreduktiont-SNEVisualisierungk-MeansApproximation

Subject (deu)

Informatik: Allgemeines

Type (deu)

Masterarbeit

Persistent identifier

https://phaidra.univie.ac.at/o:1396227

DOI

10.25365/thesis.69604