You are here

An Efficient Algorithm for Mining Frequent Sequence with Constraint Programming

TitleAn Efficient Algorithm for Mining Frequent Sequence with Constraint Programming
Publication TypeConference Paper
Year of Publication2016
AuthorsAoga, John O. R., Guns Tias, and Schaus Pierre
EditorFrasconi, Paolo, Landwehr Niels, Manco Giuseppe, and Vreeken Jilles
Conference NameMachine Learning and Knowledge Discovery in Databases: European Conference, ECML PKDD 2016, Riva del Garda, Italy, September 19-23, 2016, Proceedings, Part II
PublisherSpringer International Publishing
Conference LocationCham
ISBN Number978-3-319-46227-1

The main advantage of Constraint Programming (CP) approaches for sequential pattern mining (SPM) is their modularity, which includes the ability to add new constraints (regular expressions, length restrictions, etc.). The current best CP approach for SPM uses a global constraint (module) that computes the projected database and enforces the minimum frequency; it does this with a filtering algorithm similar to the PrefixSpan method. However, the resulting system is not as scalable as some of the most advanced mining systems like Zaki’s cSPADE. We show how, using techniques from both data mining and CP, one can use a generic constraint solver and yet outperform existing specialized systems. This is mainly due to two improvements in the module that computes the projected frequencies: first, computing the projected database can be sped up by pre-computing the positions at which a symbol can become unsupported by a sequence, thereby avoiding to scan the full sequence each time; and second by taking inspiration from the trailing used in CP solvers to devise a backtracking-aware data structure that allows fast incremental storing and restoring of the projected database. Detailed experiments show how this approach outperforms existing CP as well as specialized systems for SPM, and that the gain in efficiency translates directly into increased efficiency for other settings such as mining with regular expressions. The data and software related to this paper are available at spm/