*********************************
There is now a CONTENT FREEZE for Mercury while we switch to a new platform. It began on Friday, March 10 at 6pm and will end on Wednesday, March 15 at noon. No new content can be created during this time, but all material in the system as of the beginning of the freeze will be migrated to the new platform, including users and groups. Functionally the new site is identical to the old one. webteam@gatech.edu
*********************************
During the last decade, data mining has received great attention
from various fields. This thesis investigates data mining problems
in tree-based models and large-scale contingency tables. The first
half of the thesis pertains to the tree-based models for the
classification problem, which have been very popular in various
fields because of their interpretability and flexibility. Tree
modeling involves two major steps: tree growing and tree pruning.
Tree growing searches over the whole data set to find the
splitting point that leads to the greatest improvement in a
specified score function. Once the trees are grown, tree pruning
pursues the right sized tree that provides the best estimate of
error when the tree is applied to unseen data. In this thesis, we
propose a novel algorithm for tree pruning, called frontier-based
tree pruning (FBP). The new method has an order of computational
complexity comparable to cost-complexity pruning (CCP). Regarding
tree pruning, FBP provides a full spectrum of information: namely,
(1) given the value of the penalization parameter $lambda$, it
gives the decision tree specified by the complexity-penalization
approach; (2) given the size of a decision tree, it provides the
range of the penalization parameter $lambda$, within which the
complexity-penalization approach renders this tree size; and (3)
it finds the tree sizes that are {it inadmissible},
--- so regardless of what the value of the penalty parameter is, the
resulting tree, based on a complexity-penalization framework, will
never have these sizes. Simulations on real data sets reveal
surprising results: in the complexity-penalization approach, most
of the tree sizes are inadmissible. FBP facilitates a more
faithful implementation of cross validation (CV), which is favored
by simulations. As an extension of the FBP algorithm, we study how
CV performs in tree-based models. Considering the abundant results
available on applying CV to regression models, there is little
research on the effects of CV in classification models due to
their nonlinear structure. The main purpose of this study is to
explore the behavior of CV in tree-based models. We report
simulation studies that compare a cross-validated tree classifier
with an oracle classifier that is ideally derived on the knowledge
of underlying distributions. The main observation of this study
indicates that the difference between the testing and training
error from a cross-validated tree classifier and an oracle
classifier empirically has a linear regression relation. The
``slope'' and the ``$R^2$'' of regression models are employed as
the performance measures of a cross-validated tree classifier.
Moreover, simulation reveals that the performance of a
cross-validated tree classifier depends on the geometry, the
parameters of the underlying distributions, and the sample sizes.
Such observations can explain and justify the behavior of CV in
tree-based models.
The second half of the thesis presents multiple testing in
large-scale contingency tables and its application to pattern
recognition of protein structures. One of the most common test
procedures using two-way contingency tables is the test of
independence between two categorizations. Current significant
tests such as $chi^2$ or likelihood ratio tests provide overall
independency but bring limited information about the nature of the
association in the contingency tables. The main purpose of this
study is to develop a follow-up method to $chi^2$ or likelihood
ratio tests that identifies the significantly associated
individual cells in the contingency table. We propose a framework
of multiple testing procedures for testing independence of the
cell categories in contingency tables. In the simulation study, we
compare the power, type I error, and false discovery rate of five
different testing procedures in the contingency table. We observe
that no single procedure is superior for every scenario examined.
In addition, we record the relationships among the proportion of
true null hypotheses, power, type I error, and false discovery
rate. Finally, we employ the proposed method to identify the
patterns of pair-wise associations between amino acids involved in
$beta$-sheet bridges of proteins. We identify a number of amino
acid pairs that exhibit either strong or weak association. These
patterns provide useful information for algorithms that predict
secondary and tertiary structures of proteins.