19/09
19 a 22 hs
Aula Magna
What is big data? For this talk “big” refers to the number of samples (n) and/or number of dimensions (p) in static sets of feature vector data; or the size of (similarity or distance) matrices for relational clustering. Objectives of clustering in static sets of big numerical data are acceleration for loadable data and feasibility for non-loadable data. Three ways currently in favor to achieve these objectives are (i) streaming (online) clustering, which avoids the growth in (n) entirely; (ii) chunking and distributed processing; and (iii) sampling followed by very fast (usually 1-2% of the overall processing time) non-iterative extension to the remainder of the data. Kernel-based methods are mentioned, but not covered in this talk.
This talk describes the use of sampling followed by non-iterative extension that extend each of the “Fantastic Four” to the big data case. Three methods of sampling are covered: random, progressive, and minimax. The last portion of this talk summarizes a few of the many acceleration methods for each of the Fantastic Four. WHICH ARE? Four classical clustering methods have withstood the tests of time. I call them the Fantastic Four:
Gaussian Mixture Decomposition (GMD, 1898)
Hard c-means (often called “k-means,” HCM, 1956)
Fuzzy c-means (reduces to HCM in the limit, FCM, 1973)
SAHN Clustering (principally single linkage (SL, 1909)
For more information, contact:
Leticia Seijas: lseijas@fi.mdp.edu.ar
Daniela López De Luise: daniela_ldl@ieee.org