VenkateshGantiJohannesGehrkeRaghuRamakrishnanDepartmentofComputerSciences,UniversityofWisconsin-Madison
vganti,johannes,raghu@cs.wisc.edu
Abstract
Clusteringisanimportantdataminingproblem.Mostoftheearlierworkonclusteringfocussedonnumericattributeswhichhaveanaturalorderingontheirattributevalues.Recently,clusteringdatawithcategoricalattributes,whoseattributevaluesdonothaveanaturalordering,hasreceivedsomeattention.However,previousalgorithmsdonotgiveaformaldescriptionoftheclusterstheydiscoverandsomeofthemassumethattheuserpost-processestheoutputofthealgorithmtoidentifythefinalclusters.
Inthispaper,weintroduceanovelformalizationofaclusterforcategoricalattributesbygeneralizingadefinitionofaclusterfornumericalattributes.Wethendescribeaveryfastsummarization-basedalgorithmcalledCACTUSthatdiscoversexactlysuchclustersinthedata.CACTUShastwoimportantcharacteristics.First,thealgorithmrequiresonlytwoscansofthedataset,andhenceisveryfastandscalable.OurexperimentsonavarietyofdatasetsshowthatCACTUSoutperformspreviousworkbyafactorofto.Second,CACTUScanfindclustersinsubsetsofallattributesandcanthusperformasubspaceclusteringofthedata.Thisfeatureisimportantifclustersdonotspanallattributes,alikelyscenarioifthenumberofattributesisverylarge.Inathoroughexperimentalevaluation,westudytheperformanceofCACTUSonrealandsyntheticdatasets.
1Introduction
Clusteringisanimportantdataminingproblem.Thegoalofclustering,ingeneral,istodiscoverdenseandsparseregionsinadataset.Mostpreviousworkinclusteringfocussedonnumericaldatawhoseinherentgeometricpropertiescanbeexploitedtonaturallydefinedistancefunctionsbetweenpoints.However,manydatasetsalsoconsistofcategorical
SupportedbyaMicrosoftGraduateFellowship.SupportedbyanIBMGraduateFellowship.
ThisresearchwassupportedbyGrant2053fromtheIBMcorporation.
attributesonwhichdistancefunctionsarenotnaturallydefined.Recently,theproblemofclusteringcategoricaldatastartedreceivinginterest[GKR98,GRS99].
Asanexample,considertheMUSHROOMdatasetinthepopularUCIMachineLearningrepository[CBM98].Eachtupleinthedatasetdescribesasampleofgilledmushroomsusingtwentytwocategoricalattributes.Forinstance,thecapcolorattributecantakevaluesfromthedomainbrown,buff,cinnamon,gray,green,pink,purple,red,white,yellow.Itishardtoreasonthatonecoloris“like”or“unlike”anothercolorinawaysimilartorealnumbers.
Animportantcharacteristicofcategoricaldomainsisthattheytypicallyhaveasmallnumberofattributevalues.Forexample,thelargestdomainforacategoricalattributeofanydatasetintheUCIMachineLearningrepositoryconsistsofattributevalues(foranattributeofthePendigitsdataset).Categoricalattributeswithlargedomainsizestypicallydonotcontaininformationthatmaybeusefulforgroupingtuplesintoclasses.Forinstance,theCustomerIdattributeintheTPC-Ddatabasebenchmark[Cou95]mayconsistofmillionsofvalues;giventhatarecord(orasetofrecords)takesacertainCustomerIdvalue(orasetofvalues),wecannotinferanyinformationthatisusefulforclassifyingtherecords.Therefore,itisdifferentfromtheageorgeographicallocationattributeswhichcanbeusedtogroupcustomersbasedontheirageorlocationorboth.Typically,relationscontaintoattributes;hence,eventhoughthesizeofeachcategoricaldomainissmall,thecrossproductofalltheirdomainsandhencetherelationitselfcanbeverylarge.
Inthispaper,weintroduceafastsummarization-basedalgorithmcalledCACTUSforclusteringcategoricaldata.CACTUSexploitsthesmalldomainsizesofcategoricalat-tributes.ThecentralideainCACTUSisthatsummaryin-formationconstructedfromthedatasetissufficientfordis-coveringwell-definedclusters.Thepropertiesthatthesum-maryinformationtypicallyfitsintomainmemory,andthatitcanbeconstructedefficientlytypicallyinasinglescanofthedatasetresultinsignificantperformanceimprovements:
Attributeswhosedomainistotallyorderedarecallednumeric,whereasattributeswhosedomainisnotorderedarecalledcategorical.
CAtegoricalClusTeringUsingSummaries
afactoroftotimesoveroneofthepreviousalgorithms.Ourmaincontributionsinthispaperare:
1.Weformalizetheconceptofaclusterovercategoricalattributes(Section3).2.Weintroduceafastsummarization-basedalgorithmCAC-TUSforclusteringcategoricaldata(Section4).3.WethenextendCACTUStodiscoverclustersinsub-spaces,especiallyusefulwhenthedataconsistsofalargenumberofattributes(Section5).4.Inanextensiveexperimentalstudy,weevaluateCAC-TUSandcompareitwithearlierworkonsyntheticandrealdatasets(Section6).
2RelatedWork
Inthissection,wediscusspreviousworkonclusteringcate-goricaldata.TheEM(Expectation-Maximization)algorithmisapopulariterativeclusteringtechnique[DLR77,CS96].Startingwithaninitialclusteringmodel(amixturemodel)forthedata,ititerativelyrefinesthemodeltofitthedatabetter.Afteranindeterminatenumberofiterations,itterminatesatalocallyoptimalsolution.TheEMalgorithmassumesthattheentiredatafitsintomainmemoryandhenceisnotscalable.WenowdiscusstwopreviousscalablealgorithmsSTIRRandROCKforclusteringcategoricaldata.
Gibsonetal.introduceSTIRR,aniterativealgorithmbasedonnon-lineardynamicalsystems[GKR98].Theyrepresenteachattributevalueasaweightedvertexinagraph.Multiple
copies
,calledbasins,ofthissetofweightedverticesaremaintained;theweightsonanygivenvertexmaydifferacrossbasins.iscalledtheprincipalbasin;
arecallednon-principalbasins.Startingwitha
setofweightsonallvertices(inallbasins),thesystemis“iterated”untilafixedpointisreached.Gibsonetal.arguethatwhenthefixedpointisreached,theweightsinoneormoreofthebasinsisolatetwogroupsofattributevaluesoneachattribute:thefirstwithlargepositiveweightsandthesecondwithsmallnegativeweights,andthatthesegroupscorrespondintuitivelytoprojectionsofclustersontheattribute.However,theautomaticidentificationofsuchsetsofcloselyrelatedattributevaluesfromtheirweightsrequiresanon-trivialpost-processingstep;suchapost-processingstepwasnotaddressedintheirwork.Moreover,thepost-processingstepwillalsodeterminewhat“clusters”areoutput.Also,asweshowinSection3.2,certainclassesofclustersarenotdiscoveredbySTIRR.
Guhaetal.introduceROCK,anadaptationofanagglom-erativehierarchicalclusteringalgorithm,whichheuristicallyoptimizesacriterionfunctiondefinedintermsofthenumberof“links”betweentuples[GRS99].Informally,thenumberoflinksbetweentwotuplesisthenumberofcommonneighbors
Givenasimilarityfunction,twotuplesinthedatasetaresaidtobeneighborsifthesimilaritybetweenthemisgreaterthanacertainthreshold.
theyhaveinthedataset.Startingwitheachtupleinitsowncluster,theyrepeatedlymergethetwoclosestclusterstilltherequirednumber(say,)ofclustersremain.Sincethecom-plexityofthealgorithmiscubicinthenumberoftuplesinthedataset,theyclusterasamplerandomlydrawnfromthedataset,andthenpartitiontheentiredatasetbasedontheclus-tersfromthesample.Beyondthatthesetofall“clusters”togethermayoptimizeacriterionfunction,thesetoftuplesineachindividualclusterisnotcharacterized.
3Definitions
Inthissection,weformallydefinetheconceptofaclusterovercategoricalattributes,andotherconceptsusedintheremainderofthepaper.WethencomparetheclassofclustersallowedbyourdefinitionwiththosediscoveredbySTIRR.3.1ClusterDefinition
Intuitively,aclusteronasetofnumericattributesidentifiesa“denseregion”intheattributespace.Thatis,theregionconsistsofasignificantlylargernumberoftuplesthanexpected.Wegeneralizethisintuitivenotionfortheclassofhyper-rectangularclusterstothecategoricaldomain.Asanillustrativeexample,theregion
maycorrespondtoaclusterinthe3-dspacespannedbythreenumericattributes.Ingeneral,theclassofrectangularregionscanbeexpressedasthecrossproductofintervals.Sincedomainsofcategoricalattributesarenotordered,theconceptofanintervaldoesnotexist.However,astraightforwardgeneralizationoftheconceptofanintervaltothecategoricaldomainisasetofattributevalues.Consequently,thegeneralizationofrectangularregionsinthenumericdomaintocategoricaldomainisthecrossproductofsetsofattributevalues.Wecallsuchregionsintervalregions.
Intuitively,aclusterconsistsofasignificantlylargernumberoftuplesthanthenumberexpectedifallattributeswereindependent.Inaddition,aclusteralsoextendstoaslargearegionaspossible.Wenowformalizethisnotionforcategoricaldomainsbyfirstdefiningthenotionofatuplebelongingtoaregion,andthenthesupportofaregion,whichisthenumberoftuplesinthedatasetthatbelongtotheregion.
Definition3.1Letbeasetofcategoricalat-tributeswithdomains,respectively.Letthedatasetbeasetoftupleswhereeachtuple:
.Wecallanintervalregioniffor
all,.Letand,.Thesupportoftheattributevaluepairwithrespecttoisdefinedasfollows:
and
Classesofclustersthatcorrespondtoarbitrarilyshapedregionsinthenumericdomaincannotbegeneralizedascleanlytothecategoricaldomainbecausethecategoricalattributesdonothaveanaturalorderingimposedontheirdomains.Therefore,weonlyconsidertheclassofhyper-rectangularregions.
Atupleissaidtobelongtothe
regionifforall
,.Thesupportofisthenumberoftuplesinthatbelongto.
Ifallattributesareindependentandtheat-tributevaluesineachattributeareequallylikely(henceforthreferredtoastheattribute-independenceassumption)thentheexpectedsupportofaregionis
.Asbefore,theexpectedsupportofis.Sincethedatasetisunder-stoodfromthecontext,wewriteinsteadofinsteadof.Finally,wenotethatthe,andat-tributeindependenceassumptioncanbemodifiedtotakeany
priorinformationintoaccount;e.g.,themarginalprobabilitiesofattributevalues.Intuitively,capturestheco-occurence,andhencethesimilarity,ofattributevaluesand.Valuesandaresaidtobestronglyconnectediftheirco-occurrence()issignificantlyhigher(bysomefactor)thanthevalueexpectedundertheattribute-independence.Wenowdefinetoformalizethisintuition,andthengiveaformaldefinitionofacluster.
Definition3.2Let,,and.Theattributevaluesandarestronglyconnectedwithrespect
toif
.Thefunctionis
definedasfollows:
ifandarestronglyconnectedotherwise
Letand,,betwosetsofattributevalues.Anelementisstronglyconnectedwithif,forall,andarestronglyconnected.andaresaidtobestronglyconnectedifeachisstronglyconnectedwithandeachisstronglyconnectedwith.
Definition3.3For,let,,
and.Thenifthefollowingthreeconditionsisaareclustersatisfied.
over
(1)Forall,andarestronglyconnected.(2)Forall,,thereexistsnosuchthatforall,andarestronglyconnected.(3)Thesupportofisatleasttimestheexpectedsupportofundertheattribute-independenceassumption.Wecallthecluster-projectionofon.iscalledasub-clusterifitsatisfiesconditions(1)and(3).Aclusteroverasubsetofallattributesiscalledasubspaceclusteron;iftheniscalleda-cluster.
Becauseadeviationofortimestheexpectedvalueisusuallyconsideredsignificant[BD76],typicalvaluesofarebetweenand.
Wenowextendournotionofsimilaritytoattributevalue
pairsonthesameattribute.Let
and.Ifandarestronglyconnectedthenare“similar”toeachotherwithrespectto.Thelevelofsimilarityisthenumberofsuchdistinctattributevalues
.Wenowformalizethisintuition.
Definition3.4Let.Thesimilaritybetweenandwithrespecttoisdefinedasfollows.
and
Below,wedefinethesummaryinformationwhichweneedlatertodescribetheCACTUSalgorithm.Thesummaryin-formationisoftwotypes:(1)inter-attributesummariesand(2)intra-attributesummaries.Theinter-attributesummariesconsistofallstronglyconnectedattributevaluepairswhereeachpairhasattributevaluesfromdifferentattributes;theintra-attributesummariesconsistofsimilaritiesbetweenat-tributevaluesofthesameattribute.
Definition3.5Letbeasetofcategoricalat-tributeswithdomainsrespectively,andletbeadataset.Theinter-attributesummaryisdefinedas:
where
and
Theintra-attributesummary
isdefinedas:andwhere
and
3.2Discussion
WenowcomparetheclassofclustersallowedbyourdefinitionwiththeclustersdiscoveredbySTIRR.Forthecomparison,wegeneratetestdatausingthedatageneratordevelopedbyGibsonetal.forevaluatingSTIRR[GKR98].WeconsiderthreedatasetsshowninFigures1,2,and3.Eachdatasetconsistsoftuples.DS1andDS2havetwoattributes,DS3hasthreeattributeswhereeachattributeconsistsofattributevalues.Thesetuplesaredistributedoverallattributevaluesoneachattributeaccordingtotheattribute-independenceassumption.Wecontrolthelocationandthesizeofclustersineachdatasetbydistributinganadditionalnumberoftuples(ofthetotalnumberinthedataset)inregionsdesignatedtobeclustersthusincreasingtheirsupportsabovetheexpectedvalueundertheattribute-independenceassumption.InFigures1,2,and3,thecluster-projectionofeachclusterisshownwithinanellipse.Theboundariesofthecluster-projections(ellipses)ofaclusterareconnectedbylinesofthesametype(e.g.,solid,dashedetc.).
0000000910109109109107910179101920991920991819192019209919209920999999Figure1:DS1Figure2:DS2
0Figure3:DS3
0000009109109109107910910192099192099171819209999192099192099192099Figure4:DS1:STIRR’sO/PFigure5:DS2:STIRR’sO/PFigure6:DS3:STIRR’sO/P
WeranSTIRRonthedatasetsshowninFigures1,2,and3,andmanuallyselectedthebasinthatassignspositiveandnegativeweightsrespectivelytoattributevaluesindifferentcluster-projections.Toidentifytheclusterprojections,weobservedtheweightsallocatedbySTIRRandisolatedtwogroupssuchthattheweightsineachgrouphavelargemagnitudeandareclosetoeachother.Thecluster-projectionsidentifiedbySTIRRareshowninFigures4,5,and6.
STIRRrecognizedthecluster-projectionsforDS1onthefirstnon-principalbasin()foreveryattribute(asshowninFigure4).WhenrunonthedatasetDS2,thefirstnon-principalbasin()onidentifiesthetwogroups:
and(asshowninFigure5).The
secondnon-principalbasin()onidentifiesthefollowingtwogroups:and.Thus,nobasinidentifiestheoverlapbetweenthecluster-projections.Itmaybepossibletoidentifysuchoverlapsthroughanon-trivialpostprocessingstep.However,itisnotclearhowmanybasinsarerequiredandhowtorecognizethatcluster-projectionsoverlapfromtheweightsonattributevalues.Webelievethatanysuchpost-processingstepitselfwillbesimilartotheCACTUSalgorithm.TheresultofrunningSTIRRonthedatasetDS3isshowninFigure6.STIRRmergedthetwocluster-projectionsonthesecondattribute,possiblybecauseoneofthecluster-projectionsparticipatesinmorethanonecluster.
Fromtheseexperiments,weobservethatSTIRRfailstodiscoverthefollowingclassesofclusters:(1)clustersconsistingofoverlappingcluster-projectionsonanyattribute,(2)clusterswheretwoormoreclusterssharethesamecluster-projection.However,intuitively,thesetwoclassesofclusters
a1a2aa34b1b2b3b4c1c2cc34ABCw.r.t.
:2:2:2:2:2:2
w.r.t.
:2:2:2
w.r.t.
:3:2
Figure7:
Figure8:
shouldbevalidclassesofclusters,andourclusterdefinitionincludestheseclasses.CACTUScorrectlydiscoversalltheimplantedclustersfromthedatasetsDS1,DS2,andDS3.Thus,ourdefinitionofaclusterandhenceCACTUS,whichdiscoversallclustersallowedbyourdefinition,seemstoidentifyabroaderclassofclustersthanthatdiscoveredbySTIRR.SinceitisnotpossibletocharacterizeclustersdiscoveredbySTIRR,wecouldnotconstructanyexampledatasetsfromwhichCACTUSdoesnotretrievetheexpectedclustersandSTIRRdoes.However,itispossiblethatsuchtypesofclustersexist.
4CACTUS
Inthissection,wedescribeourthree-phaseclusteringalgo-rithmCACTUS.ThecentralideabehindCACTUSisthatasummaryoftheentiredatasetissufficienttocomputeaset
of“candidate”clusterswhichcanthenbevalidatedtodeter-minetheactualsetofclusters.CACTUSconsistsofthreephases:summarization,clustering,andvalidation.Inthesummarizationphase,wecomputethesummaryinformationfromthedataset.Intheclusteringphase,weusethesum-maryinformationtodiscoverasetofcandidateclusters.Inthevalidationphase,wedeterminetheactualsetofclus-tersfromthesetofcandidateclusters.Weintroduceahy-potheticalexamplewhichweusethroughoutthepapertoillustratethesuccessivephasesinthealgorithm.Consideradatasetwiththreeattributes,andwithdomains
,,and,respec-tively.Letthestronglyconnectedattributevaluepairsbeas
showninFigures7and8.
4.1SummarizationPhase
Inthissection,wedescribethesummarizationphaseofCACTUS.Weshowhowtoefficientlycomputetheinter-attributeandtheintra-attributesummaries,andthendescribetheresourcerequirementsformaintainingthesesummaries.Categoricalattributesusuallyhavesmalldomains.Typicalcategoricalattributedomainsconsideredforclusteringconsistoflessthanahundredor,rarely,athousandattributevalues.Animportantimplicationofthecompactnessofcategoricaldomainsisthattheinter-attributesummaryforanypairofattributesandfitsintomainmemorybecausethenumberofallpossibleattributevaluepairsfromandequals.Fortherestofthissection,weassumethattheinter-attributesummaryofanypairofattributesfitseasilyintomainmemory.(Wewillgiveanexamplelatertosupportthisassumption,andtoshowthattypicallyinter-attributesummariesformanypairsofattributestogetherfitintomainmemory.)However,forthesakeofcompleteness,weextendourtechniquesinSection5tohandlecaseswherethistraitisviolated.Thesameargumentholdsfortheintra-attributesummariesaswell.
4.1.1Inter-attributeSummaries
Wenowdiscussthecomputationoftheinter-attributesum-maries.Considerthecomputationof,.Weinitializeacountertozeroforeachpairofattributevalues
,andstartscanningthedataset.Foreachtuple,weincrementthecounterforthepair.Afterthescanofiscompleted,wecomputebyset-tingtozeroallcounterswhosevalueislessthanthethreshold
.Thus,countsofonlythestronglycon-nectedpairsareretained.Thenumberofstronglyconnectedpairsisusuallymuchsmallerthan.Therefore,thesetofstronglyconnectedpairscanbemaintainedinspecial-izeddatastructuresdesignedforsparsematrices[DER86].Wenowpresentahypotheticalexampletoillustratetheresourcerequirementsofthesimplestrategydescribedabove.Consideradatasetwithattributeseachconsistingof
Inourcurrentimplementation,wemaintainthecountsofstrongly
connectedpairsinanarrayanddonotoptimizeforspace.
attributevalues.SupposewehaveMBofmainmemory(easilyavailableoncurrentdesktopsystems).Assumingthateachcounterrequiresbyteswecanmaintaincountersfor
attributepairssimultaneously.With
attributes,wehavetoevaluateattributepairs.Therefore,wecancomputeallinter-attributesummariestogetherinjustonescanofthedataset.Thecomputationalandspacerequirementsherearesimilartothatofobtainingcountsofpairsofitemswhilecomputingfrequentitemsets[AMS96].Quiteoften,asinglescanissufficientforcomputing.Insomecases,wemayneedtoscanmultipletimeseachscancomputingforadifferentsetofpairs.Thecomputationoftheinter-attributesummariesisCPU-intensive,especiallywhenthenumberofattributesishigh,becauseforeachtupleinthedataset,wehavetoincrement
counters.Evenifwerequiremultiplescansofthe
dataset,theI/OtimeforscanningthedatasetgoesupbutthetotalCPUtimeforincrementingthecountersremainsthesame.SincetheCPUtimedominatestheoverallsummary-constructiontime,therelativeincreaseduetomultiplescansisnotsignificant.Forinstance,consideradatasetofmilliontuplesdefinedonattributes,eachconsistingofattributevalues.Experimentally,wefoundthatthetotaltimeforcomputingtheinter-attributesummariesofthedatasetwithmilliontuplesisseconds,whereasascanofthedatasettakesjustseconds.Supposewepartitionallthepairsofattributesintothreegroupsconsistingof,,and
pairsrespectively.Thecomputationoftheinter-attributesummariesofattributepairsineachgrouprequiresascanofthedataset.Thetotalcomputationtimewillbearound
seconds,whichisonlyslightlyhigherthancomputingthesummaryinonescan.
4.1.2Intra-attributeSummaries
Inthissection,wedescribethecomputationoftheintra-attributesummaries.Weagainexploitthecharacteristicthatcategoricaldomainsareverysmallandthusassumethattheintra-attributesummaryofanyattributefitsinmainmemory.OurprocedureforcomputingreflectstheevaluationofthefollowingSQLquery:
SelectFromasT1(A,B),asT2(A,B)WhereT1.AT2.AandT1.B=T2.BGroupbyT1.A,T2.AHaving;
Theabovequeryjoinswithitselftocomputethesetofattributevaluepairsofstronglyconnectedtoeachotherwithrespectto.Sincefitsinmainmemorytheself-joinandhencethecomputationofisveryfast.Wewillobserveinthenextsectionthat,atanystageofouralgorithm,weonlyrequireforaparticularpairofattributesand
Foranexpositionofjoinprocessing,seeanystandardtextbookondatabasesystems,e.g.,[Ram97].
.Therefore,wecompute,,foreachpairwheneveritisrequired.
ConsidertheexampleshowninFigure7.(Weusethe
notation
todenotetheinter-attributesummarybetweenattributesand.)Theinter-attributesummaries,
,andcorrespondtotheedgesbetweenattributevaluesinthefigure.Theintra-attributesummaries,
,areshowninFigure8.
4.2ClusteringPhase
Inthissection,wedescribethetwo-stepclusteringphaseofCACTUSthatusestheattributesummariestocomputecandidateclustersinthedata.Inthefirststep,weanalyzeeachattributetocomputeallcluster-projectionsonit.Inthesecondstep,wesynthesize,inalevel-wisemanner,candidateclustersonsetsofattributesfromthecluster-projectionsonindividualattributes.Thatis,wedeterminecandidateclustersonapairofattributes,thenextendthepairtoasetofthreeattributes,andsoon.Wenowdescribeeachstepindetail.4.2.1ComputingCluster-ProjectionsonAttributesLetbethesetofattributesandbetheirdomains.Thecentralideaforcomputingallcluster-projectionsonanattributeisthataclusteroverthesetofattributesinducesasub-clusteroveranyattributepair,.Inaddition,thecluster-projectiononoftheclusteristheintersectionofthecluster-projectionsonof-clustersoverattributepairsonthe,(theattribute.Forexample,cluster-projectioninFigurethecluster-projection
on7istheoftheintersection
of-cluster
ofthe-cluster)and(thecluster-projection).Weformalizeon
theideainthefollowinglemma.
Lemma4.1Letbeaclusteronthesetof
attributes.Then,(1)Forall,
,
isasub-cluster
overthepairofattributes.
(2)Thereexistsasetand
isa2-cluster
oversuchthat
.
Lemma4.1motivatesthefollowingtwo-stepapproach.Inthefirstpairwisecluster-projectionstep,weclustereachat-tributewithrespecttoeveryotherattribute,tofindallcluster-projectionsonof-clustersover.Inthesecondintersectionstep,wecomputeallthecluster-projectionsonofclustersoverbyinter-sectingsetsofcluster-projectionsfrom-clusterscomputedinthefirststep.However,theproblemofcomputingcluster-projectionsof-clustersinthepairwisecluster-projectionstepisatleastashardastheNP-completecliqueproblem[GJ79].
Acliqueinisasetofverticesthatareconnectedtoeachotherbyedgeswithnon-zeroweights.Givenagraphandaconstant,thecliqueproblemdeterminesifconsistsofacliqueofsizeatleast.
Bb4b3b2b1a1a2a3a4AFigure9:Extending
w.r.t.
Thefollowinglemmaformalizesthecomputationalcomplex-ity.Theproofisgiveninthefullpaper[GGR99].
Lemma4.2Letandbetwoattributes.Theproblemofcomputingallcluster-projectionsonof-clustersover
isNP-complete.
Toreducethecomputationalcomplexityofthecluster-projectionproblem,weexploitthefollowingpropertywhich,webelieve,isusuallyexhibitedbyclustersinthecategorical
domain.Ifacluster-projection
onofone(ormore)cluster(s)islargerthanafixedpositiveinteger,calledthedistinguishingnumber(denoted),thenitconsistsofasmallidentifyingsetwhichwecallthedistinguishingsetofattributevaluessuchthattheywillnottogetherbecontainedinanyothercluster-projectionon.Thus,thedistinguishingsetdistinguishesfromothercluster-projectionson.Notethatapropersubsetofthedistinguishingsetmaystillbelongtoanothercluster-projection,andthattwodistinctclustersmayshareanidenticalcluster-projection(asinFigure1).
Webelievethatthedistinguishingsubsetassumptionholdsinalmostallcases.Evenforthemostrestrictiveversion,whichoccurswhenthedistinguishingnumberisandallcluster-projectionsofthesetofclustersaredistinct,theas-sumptiononlyrequiresthateachclusterconsistofasetofattributevaluesoneoneachattributethatdoesnotbelongtoanyothercluster.FortheexampleinFigure7,thesets
oridentifythecluster-projectionontheattribute.Wenowformallystatetheassumption.
DistinguishingSubsetAssumption:Letandeachofsizegreaterthanbetwodistinctcluster-projectionsontheattribute.Thenthereexisttwosetsandsuchthat
and
Wecall
thedistinguishingnumber.
PairwiseCluster-ProjectionsWecomputecluster-projectionsonof-clustersovertheattributepairintwosteps.Inthefirststep,wefindallpossibledistinguishingsets(ofsizelessthanorequalto)on.Inthesecondstep,weextendwithrespecttosome
ofthesedistinguishingsetstocomputecluster-projectionson.Henceforth,wewrite“cluster-projectionon”insteadofa“cluster-projectiononofa-clusterover.”DistinguishingSetComputation:Inthefirststep,werelyonthefollowingtwopropertiestofindallpossibledistinguishingsetson.(1)Allpairsofattributevaluesinadistinguishingsetarestronglyconnected;thatisthedistinguishingsetformsaclique.(2)Anysubsetofadistinguishingsetisalsoaclique(monotonicityproperty).Thesetwopropertiesallowalevel-wisecliquegenerationsimilartothecandidategenerationinapriori[AMS96].Thatis,wefirstcomputeallcliquesofsize,thenusethemtocomputecliquesofsize,andsoonuntilwecomputeallcliquesofsizelessthanorequalto.Letdenotethesetofallcliquesofsizeequalto.Wegiveaninductivedescriptionoftheproceduretogeneratethe
set
.Thebasecasewhenconsistsofallpairsofstronglyconnectedattributevaluesin.Thesepairscaneasilybefoundfrom.Thesetiscomputedfromtheset()by“joining”withitself.Thejoinisthesub-setjoinusedinthecandidategenerationstepofthefrequentitemsetcomputationintheapriorialgorithm[AMS96].Wealsoremoveallthecandidatesinthatcontainaproper-subsetnotin(alasubsetpruninginapriori).
ExtensionOperation:Inthesecondstep,we“extend”someofthecandidatedistinguishingsetscomputedinthefirststeptocomputecluster-projectionsonof-clusterson
.Theintuitionbehindtheextensionoperationis
illustratedinFigure9.Supposewewanttoextendwithrespectto.Wecomputethesetofattributeonvaluesonstronglyconnectedwithwiththesetofallothervalues.Wethenextend
onthatis
stronglyconnectedwith.
Informally,theextensionofadistinguishingsetaddstoallattributevaluesinthatarestronglyconnected
withthesetofallattributevaluesin
thatisstronglyconnectedwith.Wenowintroducetheconceptsofsiblingset,subsetflag,andtheparticipationcounttoformallydescribetheextensionoperation.
Definition4.1Letandbetwoattributeswithdomains
and.Letbethesetofcluster-projectionsonof-clustersover.Letbeasetofcandidatedistinguishingsets,withrespectto,onattribute.Thesiblingsetofwithrespecttotheattributeisdefinedasfollows:
forall
iscalledthesiblingstrengthofwithrespectto.
Thesubsetflagofwithrespecttoacollectionofsetsissaidtobeset(to)ifthereexistsasetsuchthat.Otherwise,thesubsetflagofisnotset.Theparticipationcountofwithrespecttoisthesumofthesiblingstrengthswithrespecttoofallsupersetsofin.
Informally,thesubsetflagandtheparticipationcountserve
thefollowingtwopurposes.First,acluster-projectionmay
consistofmorethanonedistinguishingsetin
.Therefore,ifweextendeachsetinaparticularcluster-projectionmaybegeneratedseveraltimes,onceforeachdistinguishingsetitcontains.Toavoidtherepeatedgenerationofthesamecluster-projection,weassociatewitheachdistinguishingsetasubsetflag.Thesubsetflagindicateswhetherthedistinguishingsetisasubsetofanexistingcluster-projectionin.Therefore,ifthesubsetflagissetthenthedistinguishingsetneednotbeextended.FortheexampleshowninFigure7,thedistinguishingsetscanbothbeextendedtothecluster-projectionandon.Second,thedistinguishingsubsetassumptionappliesonlytocluster-projectionsofsizegreaterthan.Therefore,acliqueofsizelessthanorequaltomaybeacluster-projectiononitsowneventhoughitmaybeasubsetofsomeothercluster-projection.Torecognizesuchsmallcluster-projections,weassociateaparticipationcountwitheachdistinguishingset.Iftheparticipationcountofadistinguishingsetwithrespecttoislessthanitssiblingstrengththenitmaybeasmallcluster-projection.
Algorithm4.1Extend()
/*Output:*//*Initialization*/
Resetthesubsetflagsandtheparticipationcountsofalldistinguishingsetsintozeroforeach
ifthesubsetflagofisnotsetthenExtendto
Setthesubsetflagsandincrementbythesiblingstrengthoftheparticipationcountsofallsubsetsofin.end/*if*/end/*for*/
Identifyandaddsmallcluster-projections(ofsize)to
ThepseudocodeforthecomputationofisshowninAl-gorithm4.1.Below,wedescribeeachstepindetail.
Initialization:Thefirsttwostepsinitializetheprocedure:weset,andthesubsetflagsandtheirparticipation
countsofalldistinguishingsetsin
tozero.Extending:Letbethesiblingsetofwithrespectto.Letbethesiblingsetofwithrespectto.Then,weextendtothecluster-projection.Addto.Prunesubsetsof:Supposewasextendedfrom.Then,bydefinition,subsetsofcannotbethedistinguish-ingsetsofotherclusterprojectionson.Therefore,weset(to)allsubsetflagsofsubsetsof(including)in.Theparticipationcountofeachofthesesubsetsisalsoin-creasedbythesiblingstrengthof.
Identifyingsmallcluster-projections:Whileextendingdis-tinguishingsets,weonlychoosesetswhosesubsetflagsarenotset.Wecheckifeachunextendeddistinguishingsetwhosesubsetflagissetcanbeasmall(ofsizelessthan)cluster-projection.Iftheparticipationcountofequalsitssiblingstrength,thencannotbeacluster-projectiononitsown.Otherwise,maybeacluster-projection.Therefore,weaddto.
Notethatthecomputationofcluster-projectionsonrequiresonlytheinter-attributesummaryandtheintra-attributesummary.Sinceandfitintomainmemory,thecomputationisveryfast.
IntersectionofCluster-projections
Informally,theintersectionstepcomputesthesetofcluster-projectionsonofclustersoverbysucces-sivelyjoiningsetsofcluster-projectionsonof-clusters
overattributepairs
,.Fordescribingtheproce-dure,werequirethefollowingdefinition.
Definition4.2Letandbetwocollectionsofsetsofattributevalueson.Wedefinetheintersectionjoinbetweenandasfollows:
and
suchthat
and
Letbethesetofcluster-projectionsonwithrespectto,.Letif,else.Startingwith,theintersectionstepexecutesthefollowingoperationforall.
if
Theresultingsetisthesetofcluster-projectionsonofclustersover.Besidesbeingamain-memoryoperation,thenumberofcluster-projectionsonwithre-specttoanyotherattributeisusuallysmall;therefore,theintersectionstepisquitefast.
Furtheroptimizationsarepossibleoverthebasicstrategydescribedaboveforcomputingclusterprojections.Forinstance,wecancombinethecomputationofandthatof
because,foreachcluster-projectionin,wecomputeitssiblingsetwhichisacluster-projectionin.However,wedonotconsidersuchoptimizationsbecausetheclusteringphasetakesasmallfraction(lessthan)ofthetimetakenbythesummarizationphase.(OurexperimentsinSection6confirmthisobservation.)
4.2.2Level-wiseSynthesisofClusters
Inthissection,wedescribethesynthesisofcandidateclus-tersfromthecluster-projectionsonindividualattributes(com-putedasdescribedinSection4.2).Thecentralideaisthata
clusteronasetofattributesinducesasub-clusteronanysub-setoftheattributes(monotonicityproperty).Themonotonic-itypropertyfollowsdirectlyfromthedefinitionofacluster.
Wealsoexploitthefactthatwewanttocomputeclustersover
thesetofallattributes
.Informally,westartwithcluster-projectionsonandthenextendthemtoclus-tersover,thentoclustersover,andsoon.
Letbethesetofcluster-projectionsontheattribute,
.Letdenotethesetofcandidateclusters
definedoverthesetofattributes.Therefore,
.Wesuccessivelygeneratefromuntilisgeneratedorisemptyforsome.Thegenerationoffromproceedsasfollows.Set
.Foreachelement,we
attempttoaugmentwithaclusterprojectionontheattribute.Ifforall,isaweaugmentbecheckedbylookingsub-clusteronwhichcanup
toandaddto.
FortheexampleinFigure7,thecomputationofthesetofcandidateclustersproceedsasfollows.Westartwiththeseton.overWethethenattributefindthepaircandidate,and-cluster
then
thecandidate-cluster.
over
4.3Validation
Wenowdescribeaproceduretocomputethesetofactualclustersfromthesetofcandidateclusters.Someofthecandidateclustersmaynothaveenoughsupportbecausesomeofthe-clustersthatcombinetoformacandidateclustermaybeduetodifferentsetsoftuples.Torecognizesuchfalsecandidates,wecheckifthesupportofeachcandidateclusterisgreaterthantherequiredthreshold.Onlyclusterswhosesupportonpassesthethresholdrequirementareretained.Aftersettingthesupportsofallcandidateclusterstozero,westartscanningthedataset.Foreachtuple,weincrementthesupportofthecandidateclustertowhichbelongs.(Becausethesetofclusterscorrespondtodisjointintervalregions,canbelongtoatmostonecluster.)Attheendofthescan,wedeleteallcandidateclusterswhosesupportinthedatasetislessthantherequiredthreshold:timestheexpectedsupportoftheclusterundertheattribute-independenceassumption.
Byconstruction,CACTUSdiscoversallclustersthatsatisfyourclusterdefinition,andhencethefollowingtheoremholds.
Theorem4.1Giventhatthedistinguishingsubsetassump-tionholds,CACTUSfindsallandonlythoseclustersthatsat-isfyDefinition3.3.
5Extensions
Inthissection,weextendCACTUStohandleunusuallylargeattributevaluedomainsaswellastoidentifyclustersinsubspaces.
5.1LargeAttributeValueDomains
Untilnow,weassumedthatthedomainsofcategoricalattributesaresuchthattheinter-attributesummaryofanypairofattributesandtheintra-attributesummaryofanyattributefitsinmainmemory.Forthesakeofcompleteness,wemodifythesummarizationphaseofCACTUStohandlearbitrarilylargedomainsizes.
Recallthatthesummaryinformationonlyconsistsofstronglyconnectedpairsofattributevalues.Forlargedomainsizes,thenumberofstronglyconnectedattributevaluepairs(eitherfromthesameorfromdifferentattributes)relativetothenumberofallpossibleattributevaluepairsisverysmall.Weexploitthisobservationtocollapsesetsofattributevaluesoneachattributeintoasingleattributevaluethuscreatinganewdomainofsmallersize.Theintuitionisthatifapairofattributevaluesintheoriginaldomainarestronglyconnected,thenthecorrespondingpairoftransformedattributevaluesarealsostronglyconnected,providedthethresholdforstrongconnectivitybetweenattributevaluesinthetransformeddomainisthesameasthatfortheoriginaldomain.
Let
beanattributewithanunusuallylargedomain.Withoutlossofgenerality,letbethesetbethemaximumnumberofattributevalues.Let
per
attributesothattheinter-attributesummariesandtheintra-attributesummariesfitintomainmemory.Let.Weconstructofsizefrombymappingforagiven
tothevalue,theset.Formally,
ofattributevalues
where
Wesetthethresholdforthestrongconnectivityinvolvingattributevaluesinasifwasbeingused.Wethencomputetheinter-attributesummariesinvolvingusingthetransformeddomain.Foreachattributevaluethatparticipatesinastronglyconnectedpair(,),weexpandtothesetofallattributevalues
thatmapintoandform
thepairs.Wethenscanthedatasettocountthesupportsofallthesepairs,andselectthestronglyconnectedpairsamongthem;theyconstitutetheinter-attributesummary.
Thenumberofnewpairswhosesupportsaretobecountedislessthanorequaltowhererepresentsthenumberofstronglyconnectedpairsin.Ifthissetofpairsisstilllargerthanmainmemory,wecanrepeattheabovetransformationtrick.However,webelievethatsuchrepeatedapplicationwillberare.
5.2ClustersinSubspaces
CACTUSdoesnotdiscoverclustersinsubspacesforthefollowingreason.Theorderinwhichcluster-projectionsonindividualattributesarecombinedmaynotbetherightordertofindasubspacecluster.Forinstance,ifspansthesubspacedefinedbyasetofattributes
(when)thenthelevel-wisesynthesisdescribedinSection4.2.2willnotfind.
Theextensiontofindsubspaceclustersexploitsthemono-tonicitypropertyofsubspaceclusters.Thatis,aclusterinasubspaceinducesasubclusteronanysubsetof.Themonotonicitypropertyagainmotivatestheapriori-stylelevel-wisesynthesisofcandidateclustersfromthecluster-projectionsonindividualattributes.Thealgorithmdiffersintwowaysfromthealgorithmtofindclustersoverallattributes.Thefirstdifferenceisthatwedonotrestrictthatacluster-projectiononanattributeshouldparticipatein-clusterwitheveryotherattribute.Theseconddifferenceisintheproce-dureforgeneratingthesetofcandidateclusters.Wenowdis-cussbothdifferences.
Weskiptheintersectionofcluster-projectionsoneach
attribute
withrespecttoeveryotherattribute()forthefollowingtworeasons.First,aclusterinsubspacemaynotinducea-clusteronapairofattributesnotin,andhencetheintersectionofcluster-projectionsonanattributeinwithrespecttoeveryotherattributemayreturnanemptyset.Second,theintersectionmaycausethelossofmaximality(condition(2)inDefinition3.3)ofasubspacecluster.Forinstance,acluster-projectiononwithrespecttocorrespondstoa-clusterover
which,bydefinition,isasubspacecluster;truncatingsuchacluster-projectionintheintersectionstepwillnolongeryieldamaximalclusteron.
Inthecandidategenerationalgorithm,weletdenotethesetofcandidateclustersdefinedonanysetof-attributes(notnecessarily).Otherwise,thecandidategenerationproceedsexactlyasinSection4.2.2.Thereasonisthatasubspaceclusteronattributesmaynotalwaysbeinthefirstattributes.Foraclusterinasubspaceconsistingofattributes,theabovecandidategenerationprocedureexamines
),candidates.thenumberDependingofcandidateontheclustersvalueofcan(say,beprohibitivelylargerthanhigh.TheproblemofexaminingalargenumberofcandidateclustershasbeenaddressedbyAgrawaletal.[AGGR98].Theyusetheminimumdescriptionlengthprincipletoprunethenumberofcandidateclusters.Theirtechniquesapplydirectlyinourscenarioaswell.Therefore,wedonotaddressthisproblem;instead,wereferthereadertotheoriginalpaper[AGGR98].
6PerformanceEvaluation
Inthissection,weshowtheresultsofadetailedevaluationofthespeedandscalabilityofCACTUSonsyntheticandrealdatasets.WealsocomparedtheperformanceofCAC-TUSwiththeperformanceofSTIRR.OurresultsshowthatCACTUSisveryfastandscalable;itoutperformsSTIRRbyafactorbetweenand.
WeintendtocompareCACTUSandROCKafterourongoingimplemen-tationofROCKiscomplete.
Time vs. #tuples2000CACTUSSTIRR500040003000200010000Time vs. #AttributesCACTUSSTIRR250Time vs. #Attribute ValuesCACTUSSTIRR200Time (in seconds)Time (in seconds)1500Time (in seconds)15010001005005001234#tuples (in Millions)50102030#Attributes405000200400600#Attribute values8001000Figure10:Timevs.#TuplesFigure11:Timevs.#AttributesFigure12:Timevs.#Attr-values
FirstAuthor
Katz,Stonebraker,WongDeWitt,Hsiao
DeWitt,GhandeharizadehKanellakis,Beeri,VardiRamakrishnan,BeeriBancilhon,KiferAfrati,Cosmadakis
Alonso,Barbara,GarciaMolinaDevor,Elmasri
Barsolou,Keller,WiederholdBarsalou,Keller,Shalom
FirstAuthor(contd.)Ceri,Navathe
Abiteboul,GrumbachKorth,LevyAgrawal,GehaniChen,Hua,SuChen,Hua,Lam
Collmeyer,King,ShemerCopeland,Lipovski,SuCornell,Dan,Iyer,YuChang,Gupta
Fischer,Griffeth,LynchSecondAuthorKatz,WongDeWitt,David
DeWitt,GhandeharizadehAbiteboul,BeeriBeeri,SrivastavaRamakrishnan,Kim
Papadimitriou,CosmadakisGarciaMolina,BarbaraDevor,ElMasri,WeeldreyerKeller,WiederholdKeller,WiederholdSecondAuthor(contd.)Ceri,NavatheVianu,GrumbachSilbershatz,LevyJagadish,GehaniSu,Chen,ChuSu,Lee
Collmeyer,ShemerSu,Lipovski,CopelandYu,DiasLee,Cheng
Griffeth,Fischer
Table1:-clustersonthepairoffirstauthorandsecondauthorattributes
6.1
SyntheticDatasets
6.2
RealDatasets
Wefirstpresentourexperimentsonsyntheticdatasets.Thetestdatasetsweregeneratedusingthedatageneratordevel-opedbyGibsonetal.[GKR98]toevaluateSTIRR.(SeeSec-tion3.2foradescriptionofthedatagenerator.)Wesetthenumberoftuplestomillion,thenumberofattributestoandthenumberofattributevaluesforeachattributeto.Inalldatasets,thecluster-projectionsoneachattributewere
and(asshowninFigure1).Wefixthevalueofat,andthevalueofthedistinguishingnumberat.ForSTIRR,wefixedthenumberofiterationstobeassuggestedbyGibsonetal.[GKR98].
CACTUSdiscoveredtheclustersintheinputdatasetsshowninFigures1,2,and3.
Figure10plotstherunningtimewhileincreasingthenumberoftuplesfromtomillion.Figure11plotstherunningtimewhileincreasingthenumberofattributesfromto.Figure12plotstherunningtimewhileincreasingthenumberofattributevaluesfromtowhilefixingthenumberofattributesat.Whilevaryingthenumberofattributevalues,weassumedthatuntilattributevalues,theinter-attributesummarieswouldfitintomainmemory;foralargernumberofattributevalueswetookthemulti-layeredapproachdescribedinSection5.Inallcases,CACTUSistotimesfasterthanSTIRR.
Inthissection,wediscussanapplicationofCACTUStoacombinationoftwosetsofbibliographicentries.TheresultsfromtheapplicationshowthatCACTUSfindsintuitivelymeaningfulclustersfromthedatasetthussupportingourdefinitionofacluster.Thefirstsetconsistsofbibliographicentriesforarti-clesrelatedtodatabaseresearch[Wie]andthesecondsetcon-sistsofbibliographicentriesforarticlesrelatedtoThe-oreticalComputerScienceandrelatedareas[Sei].Foreacharticle,weusethefollowingfourattributes:thefirstauthor,thesecondauthor,theconferenceorthejournalofpublication,andtheyear.Ifanarticleissingly-authoredthentheauthor’snameisrepeatedinthesecondauthorattributeaswell.Thesizesofthefirstauthor,thesecondauthor,theconference,andtheyearattributedomainsforthedatabase-related,thetheory-related,andthecombinedsetsare,
,andrespec-tively.WecombinedthetwosetstogethertocheckifCAC-TUSisabletoidentifythedifferencesandtheoverlapbe-tweenthetwocommunities.Notethatforthesedomains,
someoftheinter-attributesummariesandtheintra-attributesummariesespeciallythoseinvolvingthefirstauthorandthesecondauthordimensionsdonotfitinmainmemory.How-ever,wechoosethisparticulardatasetbecauseitiseasiertoverifythevalidityoftheresultingclusters(thanforsomeother
ACMSIGMODManagement,VLDB,ACMTODS,ICDE,ACMSIGMODRecord
ACMTG,COMPGEOM,FOCS,GEOMETRY,ICALP,IPL,JCSS,JSCOMP,LIBTR,SICOMP,TCS,TRPODS,ALGORITHMICA,FOCS,ICALP,INFCTRL,IPL,JCSS,SCT,SICOMP,STOC
Table2:Cluster-projectionsonConferencew.r.t.theFirstAuthor
publiclyavailabledatasets,e.g.,theMUSHROOMdatasetfromtheUCIMachineLearningrepository).
Table5.1showssomeofthe-clustersonthefirstau-thorandthesecondauthorattributepair.Weonlypresentthedatabase-relatedcluster-projectionstoillustratethatCAC-TUSidentifiesthedifferencesbetweenthetwocommunities.Weverifiedthevalidityofeachcluster-projectionbyqueryingontheDatabaseSystemsandLogicProgrammingbibliogra-phyatthewebsitemaintainedbyMichaelLey[Ley].Sim-ilarcluster-projectionsidentifyinggroupsoftheory-relatedresearchersaswellasgroupsthatcontributetobothfieldsalsoexist.Duetospaceconstraints,weshowsomecluster-projectionscorrespondingtothelattertwotypesinthefullpaper[GGR99].
Table2showssomeofthecluster-projectionsonthecon-ferenceattributecomputedwithrespecttothefirstauthorattribute.Thefirstrowconsistsexclusivelyofagroupofdatabase-relatedconferences,thesecondconsistsexclusivelyoftheory-relatedconferences,andthethirdamixtureofbothreflectingaconsiderableoverlapbetweenthetwocommuni-ties.
7ConclusionsandFutureWork
Inthispaper,weformalizedthedefinitionofaclusterwhenthedataconsistsofcategoricalattributes,andthenintroducedafastsummarization-basedalgorithmCACTUSfordiscover-ingsuchclustersincategoricaldata.Wethenevaluatedouralgorithmagainstbothsyntheticandrealdatasets.
Infuture,weintendtoextendCACTUSinthefollowingthreedirections.First,weintendtorelaxtheclusterdefinitionbyallowingsetsofattributevaluesoneachattributewhichare“almost”stronglyconnectedtoeachother.Second,motivatedbytheobservationthatinter-attributesummariescanbein-crementallymaintainedunderadditionanddeletionoftuples,weintendtoderiveanincrementalclusteringalgorithmfromCACTUS.Third,weintendto“rank”theclustersbasedonameasureofinterestingness,say,somefunctionofthesupportofacluster.
Acknowledgements:WethankPrabhakarRaghavanforsendingusthebibliographicdata.
References
[AGGR98]RakeshAgrawal,JohannesGehrke,DimitriosGunopu-los,andPrabhakarRaghavan.Automaticsubspaceclus-teringofhighdimensionaldatafordatamining.InPro-ceedingsoftheACMSIGMODConferenceonManage-mentofData,1998.
[AMS96]RakeshAgrawal,HeikkiMannila,Ramakrishnan
Srikant,HannuToivonen,andA.InkeriVerkamo.FastDiscoveryofAssociationRules.InUsamaM.Fayyad,GregoryPiatetsky-Shapiro,PadhraicSmyth,andRa-masamyUthurusamy,editors,AdvancesinKnowledgeDiscoveryandDataMining,chapter12,pages307–328.AAAI/MITPress,1996.
[BD76]
PeterJ.BickelandKjellA.Doksum.MathematicalStatistics:BasicIdeasandSelectedTopics.PrenticeHall,1976.
[CBM98]E.KeoghC.BlakeandC.J.Merz.UCIrepositoryofmachinelearningdatabases,1998.
[Cou95]TransactionProcessingPerformanceCouncil,May1995.http://www.tpc.org.
[CS96]
P.CheesemanandJ.Stutz.Bayesianclassification(au-toclass):Theoryandresults.InU.Fayyad,G.Piatetsky-Shapiro,P.Smyth,andR.Uthurusamy,editors,Ad-vancesinKnowledgeDiscoveryandDataMining,pages153–180.MITPress,1996.
[DER86]I.S.Duff,A.M.Erisman,andJ.K.Reid.DirectMethodsforSparseMatrices.OxfordUniversityPress,1986.[DLR77]
A.P.Dempster,N.M.Laird,andD.B.Rubin.Maximumlikelihoodfromincompletedataviatheemalgorithm.JournaloftheRoyalStatisticalSociety,1977.
[GGR99]
VenkateshGanti,JohannesGehrke,andRaghuRamakr-ishnan.Cactus–clusteringcategoricaldatausingsum-maries.http://www.cs.wisc.edu/vganti/demon/cactus-full.ps,March1999.
[GJ79]
M.R.GareyandD.S.Johnson.Computersandintractability—AguidetothetheoryofNP-completeness.Freeman;BellLab,MurrayHillNJ,1979.
[GKR98]
DavidGibson,JonKleinberg,andPrabhakarRaghavan.Clusteringcategoricaldata:Anapproachbasedondy-namicalsystems.InProceedingsofthe24thInterna-tionalConferenceonVeryLargeDatabases,pages311–323,NewYorkCity,NewYork,August24-271998.[GRS99]
SudiptoGuha,RajeevRastogi,andKyuseokShim.Rock:Arobustclusteringalgorithmforcategoricalattributes.InProceedingsoftheIEEEInternationalConferenceonDataEngineering,Sydney,March1999.[Ley]MichaelLey.Computersciencebibliography.http://www.informatik.uni-trier.de/ley/db/index.html.[Ram97]RaghuRamakrishnan.DatabaseManagementSystems.McGrawHill,1997.
[Sei]J.Seiferas.Bibliographyontheoryofcomputerscience.http://liinwww.ira.uka.de/bibliography/Theory/Seiferas.[Wie]
G.Wiederhold.Bibliographyondatabasesystems.
http://liinwww.ira.uka.de/bibliography/Database/Wiederhold.
因篇幅问题不能全部显示,请点此查看更多更全内容