Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Iterate Cluster Tree #102

Open
mh1git opened this issue Aug 1, 2023 · 5 comments
Open

Iterate Cluster Tree #102

mh1git opened this issue Aug 1, 2023 · 5 comments

Comments

@mh1git
Copy link

mh1git commented Aug 1, 2023

Hello,

I created a KMeans (k = 2) recursive ClusterTree with the code snippet below. I would like to verify the 3D points each cluster contains. Is this possible?

std::size_t cluster_size = 500;
srand (time(NULL));
int nP = 20000;
DenseMatrix<double> p(3,nP);

for (int iP = 0; iP < nP; iP++)
{
  p(0,iP) = rand() / (double)RAND_MAX;
  p(1,iP) = rand() / (double)RAND_MAX;
  p(2,iP) = 0.0;
  iP++;
}

std::vector<int> perm;
perm.resize(p.cols());
std::iota(perm.begin(), perm.end(), 1);
//std::mt19937 mt(time(nullptr));
//structured::ClusterTree tree = recursive_2_means(p, cluster_size, perm.data(), mt);
structured::ClusterTree tree = recursive_cobble(p, cluster_size, perm.data());

// Diagnostics

int levels = tree.levels();
cout << "levels : " << levels << endl;
int nodes = tree.nodes();
cout << "nodes : " << nodes << endl;
std::vector<size_t> leaf_sizes = tree.leaf_sizes<size_t>();
cout << endl;
cout << "leaf_sizes" << endl;
for (int i = 0; i < leaf_sizes.size(); i++)
  cout << i << " : " << leaf_sizes[i] << endl;
cout << endl;
cout << "tree print" << endl;
tree.print();
cout << endl;
cout << "perm" << endl;
for (int i = 0; i < perm.size(); i++)
  cout << i << " : " << perm[i] << endl;
@pghysels
Copy link
Owner

pghysels commented Aug 3, 2023

I think that information is contained in the perm vector.
The leafs of the ClusterTree tree contain the sizes of the clusters, you can get that with auto ls = tree.leaf_sizes() .
Then perm[0] to perm[ls[0]] will point to the points in the first cluster.

@mh1git
Copy link
Author

mh1git commented Aug 4, 2023

Thank you for your response. It does help. However, I am not completely following.

Please consider the following example. I amended the original post to print out some of the cluster tree data to aid in the discussion.

I ran the code w/ these settings ...

std::size_t cluster_size = 4;
int nP = 20;

... which produced this output (noticed I switched to cobble stone - actual tree scheme is irrelevant for this discussion) ...

levels : 4
nodes : 15

leaf_sizes
0 : 2
1 : 3
2 : 2
3 : 3
4 : 2
5 : 3
6 : 2
7 : 3

tree print
2 3 5 2 3 5 10 2 3 5 2 3 5 10 20 

perm
0 : 7
1 : 13
2 : 11
3 : 9
4 : 15
5 : 5
6 : 19
7 : 3
8 : 17
9 : 1
10 : 18
11 : 20
12 : 16
13 : 10
14 : 2
15 : 4
16 : 8
17 : 6
18 : 14
19 : 12

My understanding is this cluster tree has 15 hierarchical clusters based on the "nodes" and the "tree print" output. The 15 hierarchical cluster tree is shown below with the number of points of each cluster.

        2
      5
        3
   10
        2
      5
        3
20
        2
      5
        3
   10
        2
      5
        3

When you say "first" cluster I am not sure how the algorithm orders the cardinality of the clusters. I would call the "first" cluster the intitial cluster which holds all 20 points. Then the "next" clusters would be two 10 point clusters. Then the "next" cluster would be the four 5 point clusters. Then final level = 4 clusters are four leaf pairs containing 2 and 3 points each. Is your description only going to show point indices in the leaf clusters? If so, I need to know the point indices in all the clusters - not just the leaf clusters.

My question is then what are the points (indices) for each of the 15 hiearachical clusters in the cluster tree? I think the demonstration above reflects the number of points of each cluster but I need the actual point indices for all clusters in the cluster tree. Maybe you can tell me how to get them all for this simple example.

@mh1git
Copy link
Author

mh1git commented Aug 4, 2023

Oh I think I see one aspect of the issue. Once we know the points in the leaf clusters we can then accumulate backwards to get the points in parent clusters as needed.

But I still don't follow getting the leaf point indices (assuming this was your initial intent). In this example you say the "first" cluster corresponds to ...

perm[0] to perm[ls[0]]

... which is ...

perm[0] to perm[2]

... or ...

7 to 11

... which is 5 indices. But all the leaf sizes are either 2 or 3 according to demonstration above. What am I missing? It would also be helpful (in general) if the structure of the permutation vector was documented. Is it?

@pghysels
Copy link
Owner

pghysels commented Aug 4, 2023

I believe the points are assigned like this:

        2:  7, 13
      5
        3:  11, 9, 15
   10
        2:  5, 19
      5
        3:  3, 17, 1
20
        2:  18, 20
      5
        3:  16, 10, 2
   10
        2:  4, 8
      5
        3:  6, 14, 12

note that the perm values are 1 based. (then permutations can be applied with LAPACK routines like ?lapmt/?lapmr)
Thee tree is typically traversed left to right, bottom to top.
I will check the documentation.

@mh1git
Copy link
Author

mh1git commented Aug 4, 2023

Ah very good. I will examine this further in my actual context usage. I consider the issue closed for now and thanks again for your help.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants