This directory contains a sample program to explain the algorithm
to compute cdf

s1-generate-suffix       < CORPUS > CORPUS.s1
sh s2-sort-suffix        < CORPUS.s1 > CORPUS.s2
s3-compute-neighbor      < CORPUS.s2 > CORPUS.s3
s4-compute-lcp           < CORPUS.s3 > CORPUS.s4
s5-detect-class          < CORPUS.s4 > RESULT

or

sh compute-table.sh < CORPUS > CORPUS.s4
sh detect-class.sh  < CORPUS > RESULT

--- 

For more practical use, there is a one single C code that combines all steps, and access code for the table. It recognizes Unicode (UTF8).

dfk.h : Header file
dfk.c : Library

dfk.c provide functions to get corpus frequency of a string, document frequency of a string, and the number of documents which contain the string
more than once.


if you need to find updates, visit http://www.ss.cs.tut.ac.jp/umemura/cicling2009/