Initiated 0.1.5 release & doc cleanup

PyDataBlog · PyDataBlog · commit ad9fefdafdea · 2020-04-17T13:48:42.000+02:00
diff --git a/Project.toml b/Project.toml
@@ -1,7 +1,7 @@
 name = "ParallelKMeans"
 uuid = "42b8e9d4-006b-409a-8472-7f34b3fb58af"
 authors = ["Bernard Brenyah", "Andrey Oskin"]
-version = "0.1.4"
+version = "0.1.5"
 
 [deps]
 Distances = "b4f34e82-e78d-54a5-968a-f98e89d6e8f7"
@@ -10,7 +10,7 @@ StatsBase = "2913bbd2-ae8a-5f71-8c99-4fb6c76f3a91"
 
 [compat]
 StatsBase = "0.32, 0.33"
-julia = "1.3, 1.4"
+julia = "1.3"
 Distances = "0.8.2"
 MLJModelInterface = "0.2.1"
 
diff --git a/docs/src/index.md b/docs/src/index.md
@@ -3,7 +3,7 @@
 ## Motivation
 
 It's actually a funny story led to the development of this package.
-What started off as a personal toy project trying to re-construct the K-Means algorithm in native Julia blew up after a heated discussion on the Julia Discourse forum when I asked for Julia optimizaition tips. Long story short, Julia community is an amazing one! Andrey offered his help and together, we decided to push the speed limits of Julia with a parallel implementation of the most famous clustering algorithm. The initial results were mind blowing so we have decided to tidy up the implementation and share with the world as a maintained Julia pacakge.
+What started off as a personal toy project trying to re-construct the K-Means algorithm in native Julia blew up after a heated discussion on the Julia Discourse forum when I asked for Julia optimization tips. Long story short, Julia community is an amazing one! Andrey offered his help and together, we decided to push the speed limits of Julia with a parallel implementation of the most famous clustering algorithm. The initial results were mind blowing so we have decided to tidy up the implementation and share with the world as a maintained Julia pacakge.
 
 Say hello to `ParallelKMeans`!
 
@@ -24,6 +24,22 @@ As a result, it is useful in practice to restart it several times to get the cor
 
 ## Installation
 
+If you are using  Julia in the recommended [Juno IDE](https://junolab.org/), the number of threads is already set to the number of available CPU cores so multithreading enabled out of the box.
+For other IDEs, multithreading must be exported in your environment before launching the Julia REPL in the command line.
+
+*TIP*: One needs to navigate or point to the Julia executable file to be able to launch it in the command line.
+Enable multi threading on Mac/Linux systems via;
+
+```bash
+export JULIA_NUM_THREADS=n  # where n is the number of threads/cores
+```
+
+For Windows systems:
+
+```bash
+set JULIA_NUM_THREADS=n  # where n is the number of threads/cores
+```
+
 You can grab the latest stable version of this package from Julia registries by simply running;
 
 *NB:* Don't forget to Julia's package manager with `]`
@@ -58,6 +74,7 @@ git checkout experimental
 - [X] Full Implementation of Triangle inequality based on [Elkan - 2003 Using the Triangle Inequality to Accelerate K-Means"](https://www.aaai.org/Papers/ICML/2003/ICML03-022.pdf).
 - [ ] Implementation of [Geometric methods to accelerate k-means algorithm](http://cs.baylor.edu/~hamerly/papers/sdm2016_rysavy_hamerly.pdf).
 - [ ] Support for other distance metrics supported by [Distances.jl](https://github.com/JuliaStats/Distances.jl#supported-distances).
+- [ ] Implementation of [Yinyang K-Means](https://www.microsoft.com/en-us/research/wp-content/uploads/2016/02/ding15.pdf).
 - [ ] Native support for tabular data inputs outside of MLJModels' interface.
 - [ ] Refactoring and finalizaiton of API desgin.
 - [ ] GPU support.
@@ -98,13 +115,14 @@ r.iterations            # number of elapsed iterations
 r.converged             # whether the procedure converged
 ```
 
-### Supported KMeans algorithm variations
+### Supported KMeans algorithm variations and recommended use cases
 
-- [Lloyd()](https://cs.nyu.edu/~roweis/csc2515-2006/readings/lloyd57.pdf)
-- [Hamerly()](https://www.researchgate.net/publication/220906984_Making_k-means_Even_Faster)
-- [Elkan()](https://www.aaai.org/Papers/ICML/2003/ICML03-022.pdf)
+- [Lloyd()](https://cs.nyu.edu/~roweis/csc2515-2006/readings/lloyd57.pdf)  - Default algorithm but only recommended for very small matrices (switch to `n_threads = 1` to avoid overhead).
+- [Hamerly()](https://www.researchgate.net/publication/220906984_Making_k-means_Even_Faster) - Useful in most cases. If uncertain about your use case, use this!
+- [Elkan()](https://www.aaai.org/Papers/ICML/2003/ICML03-022.pdf) - Recommended for high dimensional data.
 - [Geometric()](http://cs.baylor.edu/~hamerly/papers/sdm2016_rysavy_hamerly.pdf) - (Coming soon)
 - [MiniBatch()](https://www.eecs.tufts.edu/~dsculley/papers/fastkmeans.pdf) - (Coming soon)
+- [Yinyang](https://www.microsoft.com/en-us/research/wp-content/uploads/2016/02/ding15.pdf) - (Coming soon)
 
 ### Practical Usage Examples
 
@@ -176,6 +194,7 @@ ________________________________________________________________________________
 - 0.1.1 Added interface for MLJ.
 - 0.1.2 Added Elkan algorithm.
 - 0.1.3 Faster & optimized execution.
+- 0.1.4 Updated interface for MLJ with a predict function.
 
 ## Contributing
 
diff --git a/src/mlj_interface.jl b/src/mlj_interface.jl
@@ -154,6 +154,7 @@ function MMI.predict(m::KMeans, fitresult, Xnew)
     locations, cluster_labels, _ = fitresult
 
     Xarray = MMI.matrix(Xnew)
+    # TODO: Switch to non allocation method.
     (n, p), k = size(Xarray), m.k
 
     pred = zeros(Int, n)