موجز عن البحث:
|
This
paper introduces Hk-medoids, a modified version of the standard k-medoids
algorithm. The modification extends the algorithm for the problem of
clustering complex heterogeneous objects that are described by a diversity of
data types, e.g. text, images, structured data and time series. We first
proposed an intermediary fusion approach to calculate fused similarities
between objects, SMF, taking into account the similarities between the
component elements of the objects using appropriate similarity measures. The
fused approach entails uncertainty for incomplete objects or for objects
which have diverging distances according to the different component. Our
implementation of Hk-medoids proposed here works with the fused distances and
deals with the uncertainty in the fusion process. We experimentally evaluate
the potential of our proposed algorithm using five datasets with different
combinations of data types that define the objects. Our results show the
feasibility of our algorithm, and also they show a performance enhancement
when comparing to the application of the original SMF approach in combination
with a standard k-medoids that does not take uncertainty into account. In
addition, from a theoretical point of view, our proposed algorithm has lower
computation complexity than the popular PAM implementation.
|
ملخص المشاركة:
|
We define a heterogeneous dataset as a set
of complex objects, that is, those defined by several data types including
structured data, images, free text or time series. We envisage this could be
extensible to other data types. There are currently research gaps in how to
deal with such complex data. In our previous work, we have proposed an
intermediary fusion approach called SMF which produces a pairwise matrix of
distances between heterogeneous objects by fusing the distances be- tween the
individual data types. More precisely, SMF aggregates partial distances that
we compute separately from each data type, taking into consideration
uncertainty. Consequently, a single fused distance matrix is produced that
can be used to produce a clustering using a standard clustering algorithm. In
this paper we extend the practical work by evaluating SMF using the k-means
algorithm to cluster heterogeneous data. We used a dataset of prostate cancer
patients where objects are described by two basic data types, namely: structured
and time-series data. We assess the results of clustering using external
validation on multiple possible classifications of our patients. The result
shows that the SMF approach can improved the clustering configuration when
compared with clustering on an individual data type.
|
ملخص المشاركة:
|
The phenomenon of big data is closely tied
to the complexity of data processing. The amount of work required to conduct
pattern analysis on this data type is huge. Therefore, valuable patterns and
important knowledge could stay hidden, unless, new concepts and methods are
examined to apply pattern extraction procedures effectively to this
complexity. There is a lack of experimental work in this study area.
Consequently, such limitations have motivated us to carry out our research.
Our concerns are how to apply clustering analysis to big data by focusing on
one aspect: variety of data types. We introduce the problem by defining
heterogeneous data as data about objects that are described by different data
types, for example, structured data, text, time series, images, etc. Then, we
start the analysis process with objects that are described by only two or
three basic data types, yet make the definition extensible to allow for the
introduction of further data types and complexity in our objects. Generally,
our strategy is based on comparing two fusion approaches: intermediate fusion
and late fusion. In intermediate fusion, the integration process takes place
at the level of calculating similarities between heterogeneous objects and
then we operate a clustering algorithm. In late fusion, the integration
process is conducted on multiple clustering results using ensemble methods.
We begin the research by examining intermediary fusion. We propose an
intermediate fusion technique for calculating distance between heterogeneous
objects that also deals with uncertainty in distance computations. We call it
the Similarity Matrix Fusion (SMF) approach. The main idea of SMF is to
create a comprehensive view of distances for heterogeneous objects. SMF
computes and fuses DMs obtained from each of the elements separately, taking
advantage of the complementarity in the data. It is also computes the
uncertainty of our calculations in order to be used with the fusion matrix to
reflect reliable distances. Once we have a matrix representing distances
between complex objects and some measure of uncertaint, we can proceed to
cluster heterogeneous objects using standard algorithms. We provide examples
of our approach using a real dataset of prostate cancer patients that were
diagnosed with this condition at the Norwich and Norfolk University Hospital
(NNUH), UK. It was created by Bettencourt-Silva et al. [1] by integrating
data from nine different hospital information systems. Each patient’s data is
represented by a record from structured data and 22 time-series report
different blood test results. In addition, patients are classified into
pre-defined classes. The results demonstrate visual representations of both
calculations; distances as well as uncertainty, and then fused distance matrices
were used to fine tune clustering algorithms. Since the objects are already
labeled, we take the advantage of semi-supervised data analysis to evaluate
our results. Thus, we support our conclusions by examining external
clustering validation methods in order to measure the effectiveness of
involving all data type in the analysis rather than counting only one single
type of data and also to assess our proposed intermediate fusion technique.
In addition, we have created a synthetic dataset of plants that was borrowed
from the website of the Royal Horticultural Society (RHS), the UK’s leading
gardening charity [2]. The idea of constructing this dataset is having a
mixture of data types, that are used to describe heterogeneous objects,
different in its structure and data types from the cancer dataset. The
objects here are described by: structured data, photo and free text.
Similarly, we have developed the dataset by choosing objects from 3 different
plant types in order to have labeled objects. The next step of the
experiments is to examine this dataset using our proposed intermediate fusion
technique and test other heterogeneous combinations. In the future we plan to
work on the late fusion approach and compare the performance of both
integration techniques.
|