This directory contains a sample program to explain the algorithm to compute cdf s1-generate-suffix < CORPUS > CORPUS.s1 sh s2-sort-suffix < CORPUS.s1 > CORPUS.s2 s3-compute-neighbor < CORPUS.s2 > CORPUS.s3 s4-compute-lcp < CORPUS.s3 > CORPUS.s4 s5-detect-class < CORPUS.s4 > RESULT or sh compute-table.sh < CORPUS > CORPUS.s4 sh detect-class.sh < CORPUS > RESULT --- For more practical use, there is a one single C code that combines all steps, and access code for the table. It recognizes Unicode (UTF8). dfk.h : Header file dfk.c : Library dfk.c provide functions to get corpus frequency of a string, document frequency of a string, and the number of documents which contain the string more than once. if you need to find updates, visit http://www.ss.cs.tut.ac.jp/umemura/cicling2009/