您的当前位置：首页 Abstract CACTUS–Clustering Categorical Data Using Summaries

Abstract CACTUS–Clustering Categorical Data Using Summaries

来源：爱站旅游

CACTUS–ClusteringCategoricalDataUsingSummaries

VenkateshGantiJohannesGehrkeRaghuRamakrishnanDepartmentofComputerSciences,UniversityofWisconsin-Madison

vganti,johannes,raghu@cs.wisc.edu

Abstract

Clusteringisanimportantdataminingproblem.Mostoftheearlierworkonclusteringfocussedonnumericattributeswhichhaveanaturalorderingontheirattributevalues.Recently,clusteringdatawithcategoricalattributes,whoseattributevaluesdonothaveanaturalordering,hasreceivedsomeattention.However,previousalgorithmsdonotgiveaformaldescriptionoftheclusterstheydiscoverandsomeofthemassumethattheuserpost-processestheoutputofthealgorithmtoidentifytheﬁnalclusters.

Inthispaper,weintroduceanovelformalizationofaclusterforcategoricalattributesbygeneralizingadeﬁnitionofaclusterfornumericalattributes.Wethendescribeaveryfastsummarization-basedalgorithmcalledCACTUSthatdiscoversexactlysuchclustersinthedata.CACTUShastwoimportantcharacteristics.First,thealgorithmrequiresonlytwoscansofthedataset,andhenceisveryfastandscalable.OurexperimentsonavarietyofdatasetsshowthatCACTUSoutperformspreviousworkbyafactorofto.Second,CACTUScanﬁndclustersinsubsetsofallattributesandcanthusperformasubspaceclusteringofthedata.Thisfeatureisimportantifclustersdonotspanallattributes,alikelyscenarioifthenumberofattributesisverylarge.Inathoroughexperimentalevaluation,westudytheperformanceofCACTUSonrealandsyntheticdatasets.

1Introduction

Clusteringisanimportantdataminingproblem.Thegoalofclustering,ingeneral,istodiscoverdenseandsparseregionsinadataset.Mostpreviousworkinclusteringfocussedonnumericaldatawhoseinherentgeometricpropertiescanbeexploitedtonaturallydeﬁnedistancefunctionsbetweenpoints.However,manydatasetsalsoconsistofcategorical

SupportedbyaMicrosoftGraduateFellowship.SupportedbyanIBMGraduateFellowship.

ThisresearchwassupportedbyGrant2053fromtheIBMcorporation.

attributesonwhichdistancefunctionsarenotnaturallydeﬁned.Recently,theproblemofclusteringcategoricaldatastartedreceivinginterest[GKR98,GRS99].

Asanexample,considertheMUSHROOMdatasetinthepopularUCIMachineLearningrepository[CBM98].Eachtupleinthedatasetdescribesasampleofgilledmushroomsusingtwentytwocategoricalattributes.Forinstance,thecapcolorattributecantakevaluesfromthedomainbrown,buff,cinnamon,gray,green,pink,purple,red,white,yellow.Itishardtoreasonthatonecoloris“like”or“unlike”anothercolorinawaysimilartorealnumbers.

Animportantcharacteristicofcategoricaldomainsisthattheytypicallyhaveasmallnumberofattributevalues.Forexample,thelargestdomainforacategoricalattributeofanydatasetintheUCIMachineLearningrepositoryconsistsofattributevalues(foranattributeofthePendigitsdataset).Categoricalattributeswithlargedomainsizestypicallydonotcontaininformationthatmaybeusefulforgroupingtuplesintoclasses.Forinstance,theCustomerIdattributeintheTPC-Ddatabasebenchmark[Cou95]mayconsistofmillionsofvalues;giventhatarecord(orasetofrecords)takesacertainCustomerIdvalue(orasetofvalues),wecannotinferanyinformationthatisusefulforclassifyingtherecords.Therefore,itisdifferentfromtheageorgeographicallocationattributeswhichcanbeusedtogroupcustomersbasedontheirageorlocationorboth.Typically,relationscontaintoattributes;hence,eventhoughthesizeofeachcategoricaldomainissmall,thecrossproductofalltheirdomainsandhencetherelationitselfcanbeverylarge.

Inthispaper,weintroduceafastsummarization-basedalgorithmcalledCACTUSforclusteringcategoricaldata.CACTUSexploitsthesmalldomainsizesofcategoricalat-tributes.ThecentralideainCACTUSisthatsummaryin-formationconstructedfromthedatasetissufﬁcientfordis-coveringwell-deﬁnedclusters.Thepropertiesthatthesum-maryinformationtypicallyﬁtsintomainmemory,andthatitcanbeconstructedefﬁcientlytypicallyinasinglescanofthedatasetresultinsigniﬁcantperformanceimprovements:

Attributeswhosedomainistotallyorderedarecallednumeric,whereasattributeswhosedomainisnotorderedarecalledcategorical.

CAtegoricalClusTeringUsingSummaries

afactoroftotimesoveroneofthepreviousalgorithms.Ourmaincontributionsinthispaperare:

1.Weformalizetheconceptofaclusterovercategoricalattributes(Section3).2.Weintroduceafastsummarization-basedalgorithmCAC-TUSforclusteringcategoricaldata(Section4).3.WethenextendCACTUStodiscoverclustersinsub-spaces,especiallyusefulwhenthedataconsistsofalargenumberofattributes(Section5).4.Inanextensiveexperimentalstudy,weevaluateCAC-TUSandcompareitwithearlierworkonsyntheticandrealdatasets(Section6).

2RelatedWork

Inthissection,wediscusspreviousworkonclusteringcate-goricaldata.TheEM(Expectation-Maximization)algorithmisapopulariterativeclusteringtechnique[DLR77,CS96].Startingwithaninitialclusteringmodel(amixturemodel)forthedata,ititerativelyreﬁnesthemodeltoﬁtthedatabetter.Afteranindeterminatenumberofiterations,itterminatesatalocallyoptimalsolution.TheEMalgorithmassumesthattheentiredataﬁtsintomainmemoryandhenceisnotscalable.WenowdiscusstwopreviousscalablealgorithmsSTIRRandROCKforclusteringcategoricaldata.

Gibsonetal.introduceSTIRR,aniterativealgorithmbasedonnon-lineardynamicalsystems[GKR98].Theyrepresenteachattributevalueasaweightedvertexinagraph.Multiple

copies

,calledbasins,ofthissetofweightedverticesaremaintained;theweightsonanygivenvertexmaydifferacrossbasins.iscalledtheprincipalbasin;

arecallednon-principalbasins.Startingwitha

setofweightsonallvertices(inallbasins),thesystemis“iterated”untilaﬁxedpointisreached.Gibsonetal.arguethatwhentheﬁxedpointisreached,theweightsinoneormoreofthebasinsisolatetwogroupsofattributevaluesoneachattribute:theﬁrstwithlargepositiveweightsandthesecondwithsmallnegativeweights,andthatthesegroupscorrespondintuitivelytoprojectionsofclustersontheattribute.However,theautomaticidentiﬁcationofsuchsetsofcloselyrelatedattributevaluesfromtheirweightsrequiresanon-trivialpost-processingstep;suchapost-processingstepwasnotaddressedintheirwork.Moreover,thepost-processingstepwillalsodeterminewhat“clusters”areoutput.Also,asweshowinSection3.2,certainclassesofclustersarenotdiscoveredbySTIRR.

Guhaetal.introduceROCK,anadaptationofanagglom-erativehierarchicalclusteringalgorithm,whichheuristicallyoptimizesacriterionfunctiondeﬁnedintermsofthenumberof“links”betweentuples[GRS99].Informally,thenumberoflinksbetweentwotuplesisthenumberofcommonneighbors

Givenasimilarityfunction,twotuplesinthedatasetaresaidtobeneighborsifthesimilaritybetweenthemisgreaterthanacertainthreshold.

theyhaveinthedataset.Startingwitheachtupleinitsowncluster,theyrepeatedlymergethetwoclosestclusterstilltherequirednumber(say,)ofclustersremain.Sincethecom-plexityofthealgorithmiscubicinthenumberoftuplesinthedataset,theyclusterasamplerandomlydrawnfromthedataset,andthenpartitiontheentiredatasetbasedontheclus-tersfromthesample.Beyondthatthesetofall“clusters”togethermayoptimizeacriterionfunction,thesetoftuplesineachindividualclusterisnotcharacterized.

3Deﬁnitions

Inthissection,weformallydeﬁnetheconceptofaclusterovercategoricalattributes,andotherconceptsusedintheremainderofthepaper.WethencomparetheclassofclustersallowedbyourdeﬁnitionwiththosediscoveredbySTIRR.3.1ClusterDeﬁnition

Intuitively,aclusteronasetofnumericattributesidentiﬁesa“denseregion”intheattributespace.Thatis,theregionconsistsofasigniﬁcantlylargernumberoftuplesthanexpected.Wegeneralizethisintuitivenotionfortheclassofhyper-rectangularclusterstothecategoricaldomain.Asanillustrativeexample,theregion

maycorrespondtoaclusterinthe3-dspacespannedbythreenumericattributes.Ingeneral,theclassofrectangularregionscanbeexpressedasthecrossproductofintervals.Sincedomainsofcategoricalattributesarenotordered,theconceptofanintervaldoesnotexist.However,astraightforwardgeneralizationoftheconceptofanintervaltothecategoricaldomainisasetofattributevalues.Consequently,thegeneralizationofrectangularregionsinthenumericdomaintocategoricaldomainisthecrossproductofsetsofattributevalues.Wecallsuchregionsintervalregions.

Intuitively,aclusterconsistsofasigniﬁcantlylargernumberoftuplesthanthenumberexpectedifallattributeswereindependent.Inaddition,aclusteralsoextendstoaslargearegionaspossible.Wenowformalizethisnotionforcategoricaldomainsbyﬁrstdeﬁningthenotionofatuplebelongingtoaregion,andthenthesupportofaregion,whichisthenumberoftuplesinthedatasetthatbelongtotheregion.

Deﬁnition3.1Letbeasetofcategoricalat-tributeswithdomains,respectively.Letthedatasetbeasetoftupleswhereeachtuple:

.Wecallanintervalregioniffor

all,.Letand,.Thesupportoftheattributevaluepairwithrespecttoisdeﬁnedasfollows:

and

Classesofclustersthatcorrespondtoarbitrarilyshapedregionsinthenumericdomaincannotbegeneralizedascleanlytothecategoricaldomainbecausethecategoricalattributesdonothaveanaturalorderingimposedontheirdomains.Therefore,weonlyconsidertheclassofhyper-rectangularregions.

Atupleissaidtobelongtothe

regionifforall

,.Thesupportofisthenumberoftuplesinthatbelongto.

Ifallattributesareindependentandtheat-tributevaluesineachattributeareequallylikely(henceforthreferredtoastheattribute-independenceassumption)thentheexpectedsupportofaregionis

.Asbefore,theexpectedsupportofis.Sincethedatasetisunder-stoodfromthecontext,wewriteinsteadofinsteadof.Finally,wenotethatthe,andat-tributeindependenceassumptioncanbemodiﬁedtotakeany

priorinformationintoaccount;e.g.,themarginalprobabilitiesofattributevalues.Intuitively,capturestheco-occurence,andhencethesimilarity,ofattributevaluesand.Valuesandaresaidtobestronglyconnectediftheirco-occurrence()issigniﬁcantlyhigher(bysomefactor)thanthevalueexpectedundertheattribute-independence.Wenowdeﬁnetoformalizethisintuition,andthengiveaformaldeﬁnitionofacluster.

Deﬁnition3.2Let,,and.Theattributevaluesandarestronglyconnectedwithrespect

toif

.Thefunctionis

deﬁnedasfollows:

ifandarestronglyconnectedotherwise

Letand,,betwosetsofattributevalues.Anelementisstronglyconnectedwithif,forall,andarestronglyconnected.andaresaidtobestronglyconnectedifeachisstronglyconnectedwithandeachisstronglyconnectedwith.

Deﬁnition3.3For,let,,

and.Thenifthefollowingthreeconditionsisaareclustersatisﬁed.

over

(1)Forall,andarestronglyconnected.(2)Forall,,thereexistsnosuchthatforall,andarestronglyconnected.(3)Thesupportofisatleasttimestheexpectedsupportofundertheattribute-independenceassumption.Wecallthecluster-projectionofon.iscalledasub-clusterifitsatisﬁesconditions(1)and(3).Aclusteroverasubsetofallattributesiscalledasubspaceclusteron;iftheniscalleda-cluster.

Becauseadeviationofortimestheexpectedvalueisusuallyconsideredsigniﬁcant[BD76],typicalvaluesofarebetweenand.

Wenowextendournotionofsimilaritytoattributevalue

pairsonthesameattribute.Let

and.Ifandarestronglyconnectedthenare“similar”toeachotherwithrespectto.Thelevelofsimilarityisthenumberofsuchdistinctattributevalues

.Wenowformalizethisintuition.

Deﬁnition3.4Let.Thesimilaritybetweenandwithrespecttoisdeﬁnedasfollows.

and

Below,wedeﬁnethesummaryinformationwhichweneedlatertodescribetheCACTUSalgorithm.Thesummaryin-formationisoftwotypes:(1)inter-attributesummariesand(2)intra-attributesummaries.Theinter-attributesummariesconsistofallstronglyconnectedattributevaluepairswhereeachpairhasattributevaluesfromdifferentattributes;theintra-attributesummariesconsistofsimilaritiesbetweenat-tributevaluesofthesameattribute.

Deﬁnition3.5Letbeasetofcategoricalat-tributeswithdomainsrespectively,andletbeadataset.Theinter-attributesummaryisdeﬁnedas:

where

and

Theintra-attributesummary

isdeﬁnedas:andwhere

and

3.2Discussion

WenowcomparetheclassofclustersallowedbyourdeﬁnitionwiththeclustersdiscoveredbySTIRR.Forthecomparison,wegeneratetestdatausingthedatageneratordevelopedbyGibsonetal.forevaluatingSTIRR[GKR98].WeconsiderthreedatasetsshowninFigures1,2,and3.Eachdatasetconsistsoftuples.DS1andDS2havetwoattributes,DS3hasthreeattributeswhereeachattributeconsistsofattributevalues.Thesetuplesaredistributedoverallattributevaluesoneachattributeaccordingtotheattribute-independenceassumption.Wecontrolthelocationandthesizeofclustersineachdatasetbydistributinganadditionalnumberoftuples(ofthetotalnumberinthedataset)inregionsdesignatedtobeclustersthusincreasingtheirsupportsabovetheexpectedvalueundertheattribute-independenceassumption.InFigures1,2,and3,thecluster-projectionofeachclusterisshownwithinanellipse.Theboundariesofthecluster-projections(ellipses)ofaclusterareconnectedbylinesofthesametype(e.g.,solid,dashedetc.).

0000000910109109109107910179101920991920991819192019209919209920999999Figure1:DS1Figure2:DS2

0Figure3:DS3

0000009109109109107910910192099192099171819209999192099192099192099Figure4:DS1:STIRR’sO/PFigure5:DS2:STIRR’sO/PFigure6:DS3:STIRR’sO/P

WeranSTIRRonthedatasetsshowninFigures1,2,and3,andmanuallyselectedthebasinthatassignspositiveandnegativeweightsrespectivelytoattributevaluesindifferentcluster-projections.Toidentifytheclusterprojections,weobservedtheweightsallocatedbySTIRRandisolatedtwogroupssuchthattheweightsineachgrouphavelargemagnitudeandareclosetoeachother.Thecluster-projectionsidentiﬁedbySTIRRareshowninFigures4,5,and6.

STIRRrecognizedthecluster-projectionsforDS1ontheﬁrstnon-principalbasin()foreveryattribute(asshowninFigure4).WhenrunonthedatasetDS2,theﬁrstnon-principalbasin()onidentiﬁesthetwogroups:

and(asshowninFigure5).The

secondnon-principalbasin()onidentiﬁesthefollowingtwogroups:and.Thus,nobasinidentiﬁestheoverlapbetweenthecluster-projections.Itmaybepossibletoidentifysuchoverlapsthroughanon-trivialpostprocessingstep.However,itisnotclearhowmanybasinsarerequiredandhowtorecognizethatcluster-projectionsoverlapfromtheweightsonattributevalues.Webelievethatanysuchpost-processingstepitselfwillbesimilartotheCACTUSalgorithm.TheresultofrunningSTIRRonthedatasetDS3isshowninFigure6.STIRRmergedthetwocluster-projectionsonthesecondattribute,possiblybecauseoneofthecluster-projectionsparticipatesinmorethanonecluster.

Fromtheseexperiments,weobservethatSTIRRfailstodiscoverthefollowingclassesofclusters:(1)clustersconsistingofoverlappingcluster-projectionsonanyattribute,(2)clusterswheretwoormoreclusterssharethesamecluster-projection.However,intuitively,thesetwoclassesofclusters

a1a2aa34b1b2b3b4c1c2cc34ABCw.r.t.

:2:2:2:2:2:2

w.r.t.

:2:2:2

w.r.t.

:3:2

Figure7:

Figure8:

shouldbevalidclassesofclusters,andourclusterdeﬁnitionincludestheseclasses.CACTUScorrectlydiscoversalltheimplantedclustersfromthedatasetsDS1,DS2,andDS3.Thus,ourdeﬁnitionofaclusterandhenceCACTUS,whichdiscoversallclustersallowedbyourdeﬁnition,seemstoidentifyabroaderclassofclustersthanthatdiscoveredbySTIRR.SinceitisnotpossibletocharacterizeclustersdiscoveredbySTIRR,wecouldnotconstructanyexampledatasetsfromwhichCACTUSdoesnotretrievetheexpectedclustersandSTIRRdoes.However,itispossiblethatsuchtypesofclustersexist.

4CACTUS

Inthissection,wedescribeourthree-phaseclusteringalgo-rithmCACTUS.ThecentralideabehindCACTUSisthatasummaryoftheentiredatasetissufﬁcienttocomputeaset

of“candidate”clusterswhichcanthenbevalidatedtodeter-minetheactualsetofclusters.CACTUSconsistsofthreephases:summarization,clustering,andvalidation.Inthesummarizationphase,wecomputethesummaryinformationfromthedataset.Intheclusteringphase,weusethesum-maryinformationtodiscoverasetofcandidateclusters.Inthevalidationphase,wedeterminetheactualsetofclus-tersfromthesetofcandidateclusters.Weintroduceahy-potheticalexamplewhichweusethroughoutthepapertoillustratethesuccessivephasesinthealgorithm.Consideradatasetwiththreeattributes,andwithdomains

,,and,respec-tively.Letthestronglyconnectedattributevaluepairsbeas

showninFigures7and8.

4.1SummarizationPhase

Inthissection,wedescribethesummarizationphaseofCACTUS.Weshowhowtoefﬁcientlycomputetheinter-attributeandtheintra-attributesummaries,andthendescribetheresourcerequirementsformaintainingthesesummaries.Categoricalattributesusuallyhavesmalldomains.Typicalcategoricalattributedomainsconsideredforclusteringconsistoflessthanahundredor,rarely,athousandattributevalues.Animportantimplicationofthecompactnessofcategoricaldomainsisthattheinter-attributesummaryforanypairofattributesandﬁtsintomainmemorybecausethenumberofallpossibleattributevaluepairsfromandequals.Fortherestofthissection,weassumethattheinter-attributesummaryofanypairofattributesﬁtseasilyintomainmemory.(Wewillgiveanexamplelatertosupportthisassumption,andtoshowthattypicallyinter-attributesummariesformanypairsofattributestogetherﬁtintomainmemory.)However,forthesakeofcompleteness,weextendourtechniquesinSection5tohandlecaseswherethistraitisviolated.Thesameargumentholdsfortheintra-attributesummariesaswell.

4.1.1Inter-attributeSummaries

Wenowdiscussthecomputationoftheinter-attributesum-maries.Considerthecomputationof,.Weinitializeacountertozeroforeachpairofattributevalues

,andstartscanningthedataset.Foreachtuple,weincrementthecounterforthepair.Afterthescanofiscompleted,wecomputebyset-tingtozeroallcounterswhosevalueislessthanthethreshold

.Thus,countsofonlythestronglycon-nectedpairsareretained.Thenumberofstronglyconnectedpairsisusuallymuchsmallerthan.Therefore,thesetofstronglyconnectedpairscanbemaintainedinspecial-izeddatastructuresdesignedforsparsematrices[DER86].Wenowpresentahypotheticalexampletoillustratetheresourcerequirementsofthesimplestrategydescribedabove.Consideradatasetwithattributeseachconsistingof

Inourcurrentimplementation,wemaintainthecountsofstrongly

connectedpairsinanarrayanddonotoptimizeforspace.

attributevalues.SupposewehaveMBofmainmemory(easilyavailableoncurrentdesktopsystems).Assumingthateachcounterrequiresbyteswecanmaintaincountersfor

attributepairssimultaneously.With

attributes,wehavetoevaluateattributepairs.Therefore,wecancomputeallinter-attributesummariestogetherinjustonescanofthedataset.Thecomputationalandspacerequirementsherearesimilartothatofobtainingcountsofpairsofitemswhilecomputingfrequentitemsets[AMS96].Quiteoften,asinglescanissufﬁcientforcomputing.Insomecases,wemayneedtoscanmultipletimeseachscancomputingforadifferentsetofpairs.Thecomputationoftheinter-attributesummariesisCPU-intensive,especiallywhenthenumberofattributesishigh,becauseforeachtupleinthedataset,wehavetoincrement

counters.Evenifwerequiremultiplescansofthe

dataset,theI/OtimeforscanningthedatasetgoesupbutthetotalCPUtimeforincrementingthecountersremainsthesame.SincetheCPUtimedominatestheoverallsummary-constructiontime,therelativeincreaseduetomultiplescansisnotsigniﬁcant.Forinstance,consideradatasetofmilliontuplesdeﬁnedonattributes,eachconsistingofattributevalues.Experimentally,wefoundthatthetotaltimeforcomputingtheinter-attributesummariesofthedatasetwithmilliontuplesisseconds,whereasascanofthedatasettakesjustseconds.Supposewepartitionallthepairsofattributesintothreegroupsconsistingof,,and

pairsrespectively.Thecomputationoftheinter-attributesummariesofattributepairsineachgrouprequiresascanofthedataset.Thetotalcomputationtimewillbearound

seconds,whichisonlyslightlyhigherthancomputingthesummaryinonescan.

4.1.2Intra-attributeSummaries

Inthissection,wedescribethecomputationoftheintra-attributesummaries.Weagainexploitthecharacteristicthatcategoricaldomainsareverysmallandthusassumethattheintra-attributesummaryofanyattributeﬁtsinmainmemory.OurprocedureforcomputingreﬂectstheevaluationofthefollowingSQLquery:

SelectFromasT1(A,B),asT2(A,B)WhereT1.AT2.AandT1.B=T2.BGroupbyT1.A,T2.AHaving;

Theabovequeryjoinswithitselftocomputethesetofattributevaluepairsofstronglyconnectedtoeachotherwithrespectto.Sinceﬁtsinmainmemorytheself-joinandhencethecomputationofisveryfast.Wewillobserveinthenextsectionthat,atanystageofouralgorithm,weonlyrequireforaparticularpairofattributesand

Foranexpositionofjoinprocessing,seeanystandardtextbookondatabasesystems,e.g.,[Ram97].

.Therefore,wecompute,,foreachpairwheneveritisrequired.

ConsidertheexampleshowninFigure7.(Weusethe

notation

todenotetheinter-attributesummarybetweenattributesand.)Theinter-attributesummaries,

,andcorrespondtotheedgesbetweenattributevaluesintheﬁgure.Theintra-attributesummaries,

,areshowninFigure8.

4.2ClusteringPhase

Inthissection,wedescribethetwo-stepclusteringphaseofCACTUSthatusestheattributesummariestocomputecandidateclustersinthedata.Intheﬁrststep,weanalyzeeachattributetocomputeallcluster-projectionsonit.Inthesecondstep,wesynthesize,inalevel-wisemanner,candidateclustersonsetsofattributesfromthecluster-projectionsonindividualattributes.Thatis,wedeterminecandidateclustersonapairofattributes,thenextendthepairtoasetofthreeattributes,andsoon.Wenowdescribeeachstepindetail.4.2.1ComputingCluster-ProjectionsonAttributesLetbethesetofattributesandbetheirdomains.Thecentralideaforcomputingallcluster-projectionsonanattributeisthataclusteroverthesetofattributesinducesasub-clusteroveranyattributepair,.Inaddition,thecluster-projectiononoftheclusteristheintersectionofthecluster-projectionsonof-clustersoverattributepairsonthe,(theattribute.Forexample,cluster-projectioninFigurethecluster-projection

on7istheoftheintersection

of-cluster

ofthe-cluster)and(thecluster-projection).Weformalizeon

theideainthefollowinglemma.

Lemma4.1Letbeaclusteronthesetof

attributes.Then,(1)Forall,

isasub-cluster

overthepairofattributes.

(2)Thereexistsasetand

isa2-cluster

oversuchthat

Lemma4.1motivatesthefollowingtwo-stepapproach.Intheﬁrstpairwisecluster-projectionstep,weclustereachat-tributewithrespecttoeveryotherattribute,toﬁndallcluster-projectionsonof-clustersover.Inthesecondintersectionstep,wecomputeallthecluster-projectionsonofclustersoverbyinter-sectingsetsofcluster-projectionsfrom-clusterscomputedintheﬁrststep.However,theproblemofcomputingcluster-projectionsof-clustersinthepairwisecluster-projectionstepisatleastashardastheNP-completecliqueproblem[GJ79].

Acliqueinisasetofverticesthatareconnectedtoeachotherbyedgeswithnon-zeroweights.Givenagraphandaconstant,thecliqueproblemdeterminesifconsistsofacliqueofsizeatleast.

Bb4b3b2b1a1a2a3a4AFigure9:Extending

w.r.t.

Thefollowinglemmaformalizesthecomputationalcomplex-ity.Theproofisgiveninthefullpaper[GGR99].

Lemma4.2Letandbetwoattributes.Theproblemofcomputingallcluster-projectionsonof-clustersover

isNP-complete.

Toreducethecomputationalcomplexityofthecluster-projectionproblem,weexploitthefollowingpropertywhich,webelieve,isusuallyexhibitedbyclustersinthecategorical

domain.Ifacluster-projection

onofone(ormore)cluster(s)islargerthanaﬁxedpositiveinteger,calledthedistinguishingnumber(denoted),thenitconsistsofasmallidentifyingsetwhichwecallthedistinguishingsetofattributevaluessuchthattheywillnottogetherbecontainedinanyothercluster-projectionon.Thus,thedistinguishingsetdistinguishesfromothercluster-projectionson.Notethatapropersubsetofthedistinguishingsetmaystillbelongtoanothercluster-projection,andthattwodistinctclustersmayshareanidenticalcluster-projection(asinFigure1).

Webelievethatthedistinguishingsubsetassumptionholdsinalmostallcases.Evenforthemostrestrictiveversion,whichoccurswhenthedistinguishingnumberisandallcluster-projectionsofthesetofclustersaredistinct,theas-sumptiononlyrequiresthateachclusterconsistofasetofattributevaluesoneoneachattributethatdoesnotbelongtoanyothercluster.FortheexampleinFigure7,thesets

oridentifythecluster-projectionontheattribute.Wenowformallystatetheassumption.

DistinguishingSubsetAssumption:Letandeachofsizegreaterthanbetwodistinctcluster-projectionsontheattribute.Thenthereexisttwosetsandsuchthat

and

Wecall

thedistinguishingnumber.

PairwiseCluster-ProjectionsWecomputecluster-projectionsonof-clustersovertheattributepairintwosteps.Intheﬁrststep,weﬁndallpossibledistinguishingsets(ofsizelessthanorequalto)on.Inthesecondstep,weextendwithrespecttosome

ofthesedistinguishingsetstocomputecluster-projectionson.Henceforth,wewrite“cluster-projectionon”insteadofa“cluster-projectiononofa-clusterover.”DistinguishingSetComputation:Intheﬁrststep,werelyonthefollowingtwopropertiestoﬁndallpossibledistinguishingsetson.(1)Allpairsofattributevaluesinadistinguishingsetarestronglyconnected;thatisthedistinguishingsetformsaclique.(2)Anysubsetofadistinguishingsetisalsoaclique(monotonicityproperty).Thesetwopropertiesallowalevel-wisecliquegenerationsimilartothecandidategenerationinapriori[AMS96].Thatis,weﬁrstcomputeallcliquesofsize,thenusethemtocomputecliquesofsize,andsoonuntilwecomputeallcliquesofsizelessthanorequalto.Letdenotethesetofallcliquesofsizeequalto.Wegiveaninductivedescriptionoftheproceduretogeneratethe

set

.Thebasecasewhenconsistsofallpairsofstronglyconnectedattributevaluesin.Thesepairscaneasilybefoundfrom.Thesetiscomputedfromtheset()by“joining”withitself.Thejoinisthesub-setjoinusedinthecandidategenerationstepofthefrequentitemsetcomputationintheapriorialgorithm[AMS96].Wealsoremoveallthecandidatesinthatcontainaproper-subsetnotin(alasubsetpruninginapriori).

ExtensionOperation:Inthesecondstep,we“extend”someofthecandidatedistinguishingsetscomputedintheﬁrststeptocomputecluster-projectionsonof-clusterson

.Theintuitionbehindtheextensionoperationis

illustratedinFigure9.Supposewewanttoextendwithrespectto.Wecomputethesetofattributeonvaluesonstronglyconnectedwithwiththesetofallothervalues.Wethenextend

onthatis

stronglyconnectedwith.

Informally,theextensionofadistinguishingsetaddstoallattributevaluesinthatarestronglyconnected

withthesetofallattributevaluesin

thatisstronglyconnectedwith.Wenowintroducetheconceptsofsiblingset,subsetﬂag,andtheparticipationcounttoformallydescribetheextensionoperation.

Deﬁnition4.1Letandbetwoattributeswithdomains

and.Letbethesetofcluster-projectionsonof-clustersover.Letbeasetofcandidatedistinguishingsets,withrespectto,onattribute.Thesiblingsetofwithrespecttotheattributeisdeﬁnedasfollows:

forall

iscalledthesiblingstrengthofwithrespectto.

Thesubsetﬂagofwithrespecttoacollectionofsetsissaidtobeset(to)ifthereexistsasetsuchthat.Otherwise,thesubsetﬂagofisnotset.Theparticipationcountofwithrespecttoisthesumofthesiblingstrengthswithrespecttoofallsupersetsofin.

Informally,thesubsetﬂagandtheparticipationcountserve

thefollowingtwopurposes.First,acluster-projectionmay

consistofmorethanonedistinguishingsetin

.Therefore,ifweextendeachsetinaparticularcluster-projectionmaybegeneratedseveraltimes,onceforeachdistinguishingsetitcontains.Toavoidtherepeatedgenerationofthesamecluster-projection,weassociatewitheachdistinguishingsetasubsetﬂag.Thesubsetﬂagindicateswhetherthedistinguishingsetisasubsetofanexistingcluster-projectionin.Therefore,ifthesubsetﬂagissetthenthedistinguishingsetneednotbeextended.FortheexampleshowninFigure7,thedistinguishingsetscanbothbeextendedtothecluster-projectionandon.Second,thedistinguishingsubsetassumptionappliesonlytocluster-projectionsofsizegreaterthan.Therefore,acliqueofsizelessthanorequaltomaybeacluster-projectiononitsowneventhoughitmaybeasubsetofsomeothercluster-projection.Torecognizesuchsmallcluster-projections,weassociateaparticipationcountwitheachdistinguishingset.Iftheparticipationcountofadistinguishingsetwithrespecttoislessthanitssiblingstrengththenitmaybeasmallcluster-projection.

Algorithm4.1Extend()

/*Output:*//*Initialization*/

Resetthesubsetﬂagsandtheparticipationcountsofalldistinguishingsetsintozeroforeach

ifthesubsetﬂagofisnotsetthenExtendto

Setthesubsetﬂagsandincrementbythesiblingstrengthoftheparticipationcountsofallsubsetsofin.end/*if*/end/*for*/

Identifyandaddsmallcluster-projections(ofsize)to

ThepseudocodeforthecomputationofisshowninAl-gorithm4.1.Below,wedescribeeachstepindetail.

Initialization:Theﬁrsttwostepsinitializetheprocedure:weset,andthesubsetﬂagsandtheirparticipation

countsofalldistinguishingsetsin

tozero.Extending:Letbethesiblingsetofwithrespectto.Letbethesiblingsetofwithrespectto.Then,weextendtothecluster-projection.Addto.Prunesubsetsof:Supposewasextendedfrom.Then,bydeﬁnition,subsetsofcannotbethedistinguish-ingsetsofotherclusterprojectionson.Therefore,weset(to)allsubsetﬂagsofsubsetsof(including)in.Theparticipationcountofeachofthesesubsetsisalsoin-creasedbythesiblingstrengthof.

Identifyingsmallcluster-projections:Whileextendingdis-tinguishingsets,weonlychoosesetswhosesubsetﬂagsarenotset.Wecheckifeachunextendeddistinguishingsetwhosesubsetﬂagissetcanbeasmall(ofsizelessthan)cluster-projection.Iftheparticipationcountofequalsitssiblingstrength,thencannotbeacluster-projectiononitsown.Otherwise,maybeacluster-projection.Therefore,weaddto.

Notethatthecomputationofcluster-projectionsonrequiresonlytheinter-attributesummaryandtheintra-attributesummary.Sinceandﬁtintomainmemory,thecomputationisveryfast.

IntersectionofCluster-projections

Informally,theintersectionstepcomputesthesetofcluster-projectionsonofclustersoverbysucces-sivelyjoiningsetsofcluster-projectionsonof-clusters

overattributepairs

,.Fordescribingtheproce-dure,werequirethefollowingdeﬁnition.

Deﬁnition4.2Letandbetwocollectionsofsetsofattributevalueson.Wedeﬁnetheintersectionjoinbetweenandasfollows:

and

suchthat

and

Letbethesetofcluster-projectionsonwithrespectto,.Letif,else.Startingwith,theintersectionstepexecutesthefollowingoperationforall.

Theresultingsetisthesetofcluster-projectionsonofclustersover.Besidesbeingamain-memoryoperation,thenumberofcluster-projectionsonwithre-specttoanyotherattributeisusuallysmall;therefore,theintersectionstepisquitefast.

Furtheroptimizationsarepossibleoverthebasicstrategydescribedaboveforcomputingclusterprojections.Forinstance,wecancombinethecomputationofandthatof

because,foreachcluster-projectionin,wecomputeitssiblingsetwhichisacluster-projectionin.However,wedonotconsidersuchoptimizationsbecausetheclusteringphasetakesasmallfraction(lessthan)ofthetimetakenbythesummarizationphase.(OurexperimentsinSection6conﬁrmthisobservation.)

4.2.2Level-wiseSynthesisofClusters

Inthissection,wedescribethesynthesisofcandidateclus-tersfromthecluster-projectionsonindividualattributes(com-putedasdescribedinSection4.2).Thecentralideaisthata

clusteronasetofattributesinducesasub-clusteronanysub-setoftheattributes(monotonicityproperty).Themonotonic-itypropertyfollowsdirectlyfromthedeﬁnitionofacluster.

Wealsoexploitthefactthatwewanttocomputeclustersover

thesetofallattributes

.Informally,westartwithcluster-projectionsonandthenextendthemtoclus-tersover,thentoclustersover,andsoon.

Letbethesetofcluster-projectionsontheattribute,

.Letdenotethesetofcandidateclusters

deﬁnedoverthesetofattributes.Therefore,

.Wesuccessivelygeneratefromuntilisgeneratedorisemptyforsome.Thegenerationoffromproceedsasfollows.Set

.Foreachelement,we

attempttoaugmentwithaclusterprojectionontheattribute.Ifforall,isaweaugmentbecheckedbylookingsub-clusteronwhichcanup

toandaddto.

FortheexampleinFigure7,thecomputationofthesetofcandidateclustersproceedsasfollows.Westartwiththeseton.overWethethenattributeﬁndthepaircandidate,and-cluster

then

thecandidate-cluster.

over

4.3Validation

Wenowdescribeaproceduretocomputethesetofactualclustersfromthesetofcandidateclusters.Someofthecandidateclustersmaynothaveenoughsupportbecausesomeofthe-clustersthatcombinetoformacandidateclustermaybeduetodifferentsetsoftuples.Torecognizesuchfalsecandidates,wecheckifthesupportofeachcandidateclusterisgreaterthantherequiredthreshold.Onlyclusterswhosesupportonpassesthethresholdrequirementareretained.Aftersettingthesupportsofallcandidateclusterstozero,westartscanningthedataset.Foreachtuple,weincrementthesupportofthecandidateclustertowhichbelongs.(Becausethesetofclusterscorrespondtodisjointintervalregions,canbelongtoatmostonecluster.)Attheendofthescan,wedeleteallcandidateclusterswhosesupportinthedatasetislessthantherequiredthreshold:timestheexpectedsupportoftheclusterundertheattribute-independenceassumption.

Byconstruction,CACTUSdiscoversallclustersthatsatisfyourclusterdeﬁnition,andhencethefollowingtheoremholds.

Theorem4.1Giventhatthedistinguishingsubsetassump-tionholds,CACTUSﬁndsallandonlythoseclustersthatsat-isfyDeﬁnition3.3.

5Extensions

Inthissection,weextendCACTUStohandleunusuallylargeattributevaluedomainsaswellastoidentifyclustersinsubspaces.

5.1LargeAttributeValueDomains

Untilnow,weassumedthatthedomainsofcategoricalattributesaresuchthattheinter-attributesummaryofanypairofattributesandtheintra-attributesummaryofanyattributeﬁtsinmainmemory.Forthesakeofcompleteness,wemodifythesummarizationphaseofCACTUStohandlearbitrarilylargedomainsizes.

Recallthatthesummaryinformationonlyconsistsofstronglyconnectedpairsofattributevalues.Forlargedomainsizes,thenumberofstronglyconnectedattributevaluepairs(eitherfromthesameorfromdifferentattributes)relativetothenumberofallpossibleattributevaluepairsisverysmall.Weexploitthisobservationtocollapsesetsofattributevaluesoneachattributeintoasingleattributevaluethuscreatinganewdomainofsmallersize.Theintuitionisthatifapairofattributevaluesintheoriginaldomainarestronglyconnected,thenthecorrespondingpairoftransformedattributevaluesarealsostronglyconnected,providedthethresholdforstrongconnectivitybetweenattributevaluesinthetransformeddomainisthesameasthatfortheoriginaldomain.

Let

beanattributewithanunusuallylargedomain.Withoutlossofgenerality,letbethesetbethemaximumnumberofattributevalues.Let

per

attributesothattheinter-attributesummariesandtheintra-attributesummariesﬁtintomainmemory.Let.Weconstructofsizefrombymappingforagiven

tothevalue,theset.Formally,

ofattributevalues

where

Wesetthethresholdforthestrongconnectivityinvolvingattributevaluesinasifwasbeingused.Wethencomputetheinter-attributesummariesinvolvingusingthetransformeddomain.Foreachattributevaluethatparticipatesinastronglyconnectedpair(,),weexpandtothesetofallattributevalues

thatmapintoandform

thepairs.Wethenscanthedatasettocountthesupportsofallthesepairs,andselectthestronglyconnectedpairsamongthem;theyconstitutetheinter-attributesummary.

Thenumberofnewpairswhosesupportsaretobecountedislessthanorequaltowhererepresentsthenumberofstronglyconnectedpairsin.Ifthissetofpairsisstilllargerthanmainmemory,wecanrepeattheabovetransformationtrick.However,webelievethatsuchrepeatedapplicationwillberare.

5.2ClustersinSubspaces

CACTUSdoesnotdiscoverclustersinsubspacesforthefollowingreason.Theorderinwhichcluster-projectionsonindividualattributesarecombinedmaynotbetherightordertoﬁndasubspacecluster.Forinstance,ifspansthesubspacedeﬁnedbyasetofattributes

(when)thenthelevel-wisesynthesisdescribedinSection4.2.2willnotﬁnd.

Theextensiontoﬁndsubspaceclustersexploitsthemono-tonicitypropertyofsubspaceclusters.Thatis,aclusterinasubspaceinducesasubclusteronanysubsetof.Themonotonicitypropertyagainmotivatestheapriori-stylelevel-wisesynthesisofcandidateclustersfromthecluster-projectionsonindividualattributes.Thealgorithmdiffersintwowaysfromthealgorithmtoﬁndclustersoverallattributes.Theﬁrstdifferenceisthatwedonotrestrictthatacluster-projectiononanattributeshouldparticipatein-clusterwitheveryotherattribute.Theseconddifferenceisintheproce-dureforgeneratingthesetofcandidateclusters.Wenowdis-cussbothdifferences.

Weskiptheintersectionofcluster-projectionsoneach

attribute

withrespecttoeveryotherattribute()forthefollowingtworeasons.First,aclusterinsubspacemaynotinducea-clusteronapairofattributesnotin,andhencetheintersectionofcluster-projectionsonanattributeinwithrespecttoeveryotherattributemayreturnanemptyset.Second,theintersectionmaycausethelossofmaximality(condition(2)inDeﬁnition3.3)ofasubspacecluster.Forinstance,acluster-projectiononwithrespecttocorrespondstoa-clusterover

which,bydeﬁnition,isasubspacecluster;truncatingsuchacluster-projectionintheintersectionstepwillnolongeryieldamaximalclusteron.

Inthecandidategenerationalgorithm,weletdenotethesetofcandidateclustersdeﬁnedonanysetof-attributes(notnecessarily).Otherwise,thecandidategenerationproceedsexactlyasinSection4.2.2.Thereasonisthatasubspaceclusteronattributesmaynotalwaysbeintheﬁrstattributes.Foraclusterinasubspaceconsistingofattributes,theabovecandidategenerationprocedureexamines

),candidates.thenumberDependingofcandidateontheclustersvalueofcan(say,beprohibitivelylargerthanhigh.TheproblemofexaminingalargenumberofcandidateclustershasbeenaddressedbyAgrawaletal.[AGGR98].Theyusetheminimumdescriptionlengthprincipletoprunethenumberofcandidateclusters.Theirtechniquesapplydirectlyinourscenarioaswell.Therefore,wedonotaddressthisproblem;instead,wereferthereadertotheoriginalpaper[AGGR98].

6PerformanceEvaluation

Inthissection,weshowtheresultsofadetailedevaluationofthespeedandscalabilityofCACTUSonsyntheticandrealdatasets.WealsocomparedtheperformanceofCAC-TUSwiththeperformanceofSTIRR.OurresultsshowthatCACTUSisveryfastandscalable;itoutperformsSTIRRbyafactorbetweenand.

WeintendtocompareCACTUSandROCKafterourongoingimplemen-tationofROCKiscomplete.

Time vs. #tuples2000CACTUSSTIRR500040003000200010000Time vs. #AttributesCACTUSSTIRR250Time vs. #Attribute ValuesCACTUSSTIRR200Time (in seconds)Time (in seconds)1500Time (in seconds)15010001005005001234#tuples (in Millions)50102030#Attributes405000200400600#Attribute values8001000Figure10:Timevs.#TuplesFigure11:Timevs.#AttributesFigure12:Timevs.#Attr-values

FirstAuthor

Katz,Stonebraker,WongDeWitt,Hsiao

DeWitt,GhandeharizadehKanellakis,Beeri,VardiRamakrishnan,BeeriBancilhon,KiferAfrati,Cosmadakis

Alonso,Barbara,GarciaMolinaDevor,Elmasri

Barsolou,Keller,WiederholdBarsalou,Keller,Shalom

FirstAuthor(contd.)Ceri,Navathe

Abiteboul,GrumbachKorth,LevyAgrawal,GehaniChen,Hua,SuChen,Hua,Lam

Collmeyer,King,ShemerCopeland,Lipovski,SuCornell,Dan,Iyer,YuChang,Gupta

Fischer,Griffeth,LynchSecondAuthorKatz,WongDeWitt,David

DeWitt,GhandeharizadehAbiteboul,BeeriBeeri,SrivastavaRamakrishnan,Kim

Papadimitriou,CosmadakisGarciaMolina,BarbaraDevor,ElMasri,WeeldreyerKeller,WiederholdKeller,WiederholdSecondAuthor(contd.)Ceri,NavatheVianu,GrumbachSilbershatz,LevyJagadish,GehaniSu,Chen,ChuSu,Lee

Collmeyer,ShemerSu,Lipovski,CopelandYu,DiasLee,Cheng

Griffeth,Fischer

Table1:-clustersonthepairofﬁrstauthorandsecondauthorattributes

6.1

SyntheticDatasets

6.2

RealDatasets

Weﬁrstpresentourexperimentsonsyntheticdatasets.Thetestdatasetsweregeneratedusingthedatageneratordevel-opedbyGibsonetal.[GKR98]toevaluateSTIRR.(SeeSec-tion3.2foradescriptionofthedatagenerator.)Wesetthenumberoftuplestomillion,thenumberofattributestoandthenumberofattributevaluesforeachattributeto.Inalldatasets,thecluster-projectionsoneachattributewere

and(asshowninFigure1).Weﬁxthevalueofat,andthevalueofthedistinguishingnumberat.ForSTIRR,weﬁxedthenumberofiterationstobeassuggestedbyGibsonetal.[GKR98].

CACTUSdiscoveredtheclustersintheinputdatasetsshowninFigures1,2,and3.

Figure10plotstherunningtimewhileincreasingthenumberoftuplesfromtomillion.Figure11plotstherunningtimewhileincreasingthenumberofattributesfromto.Figure12plotstherunningtimewhileincreasingthenumberofattributevaluesfromtowhileﬁxingthenumberofattributesat.Whilevaryingthenumberofattributevalues,weassumedthatuntilattributevalues,theinter-attributesummarieswouldﬁtintomainmemory;foralargernumberofattributevalueswetookthemulti-layeredapproachdescribedinSection5.Inallcases,CACTUSistotimesfasterthanSTIRR.

Inthissection,wediscussanapplicationofCACTUStoacombinationoftwosetsofbibliographicentries.TheresultsfromtheapplicationshowthatCACTUSﬁndsintuitivelymeaningfulclustersfromthedatasetthussupportingourdeﬁnitionofacluster.Theﬁrstsetconsistsofbibliographicentriesforarti-clesrelatedtodatabaseresearch[Wie]andthesecondsetcon-sistsofbibliographicentriesforarticlesrelatedtoThe-oreticalComputerScienceandrelatedareas[Sei].Foreacharticle,weusethefollowingfourattributes:theﬁrstauthor,thesecondauthor,theconferenceorthejournalofpublication,andtheyear.Ifanarticleissingly-authoredthentheauthor’snameisrepeatedinthesecondauthorattributeaswell.Thesizesoftheﬁrstauthor,thesecondauthor,theconference,andtheyearattributedomainsforthedatabase-related,thetheory-related,andthecombinedsetsare,

,andrespec-tively.WecombinedthetwosetstogethertocheckifCAC-TUSisabletoidentifythedifferencesandtheoverlapbe-tweenthetwocommunities.Notethatforthesedomains,

someoftheinter-attributesummariesandtheintra-attributesummariesespeciallythoseinvolvingtheﬁrstauthorandthesecondauthordimensionsdonotﬁtinmainmemory.How-ever,wechoosethisparticulardatasetbecauseitiseasiertoverifythevalidityoftheresultingclusters(thanforsomeother

ACMSIGMODManagement,VLDB,ACMTODS,ICDE,ACMSIGMODRecord

ACMTG,COMPGEOM,FOCS,GEOMETRY,ICALP,IPL,JCSS,JSCOMP,LIBTR,SICOMP,TCS,TRPODS,ALGORITHMICA,FOCS,ICALP,INFCTRL,IPL,JCSS,SCT,SICOMP,STOC

Table2:Cluster-projectionsonConferencew.r.t.theFirstAuthor

publiclyavailabledatasets,e.g.,theMUSHROOMdatasetfromtheUCIMachineLearningrepository).

Table5.1showssomeofthe-clustersontheﬁrstau-thorandthesecondauthorattributepair.Weonlypresentthedatabase-relatedcluster-projectionstoillustratethatCAC-TUSidentiﬁesthedifferencesbetweenthetwocommunities.Weveriﬁedthevalidityofeachcluster-projectionbyqueryingontheDatabaseSystemsandLogicProgrammingbibliogra-phyatthewebsitemaintainedbyMichaelLey[Ley].Sim-ilarcluster-projectionsidentifyinggroupsoftheory-relatedresearchersaswellasgroupsthatcontributetobothﬁeldsalsoexist.Duetospaceconstraints,weshowsomecluster-projectionscorrespondingtothelattertwotypesinthefullpaper[GGR99].

Table2showssomeofthecluster-projectionsonthecon-ferenceattributecomputedwithrespecttotheﬁrstauthorattribute.Theﬁrstrowconsistsexclusivelyofagroupofdatabase-relatedconferences,thesecondconsistsexclusivelyoftheory-relatedconferences,andthethirdamixtureofbothreﬂectingaconsiderableoverlapbetweenthetwocommuni-ties.

7ConclusionsandFutureWork

Inthispaper,weformalizedthedeﬁnitionofaclusterwhenthedataconsistsofcategoricalattributes,andthenintroducedafastsummarization-basedalgorithmCACTUSfordiscover-ingsuchclustersincategoricaldata.Wethenevaluatedouralgorithmagainstbothsyntheticandrealdatasets.

Infuture,weintendtoextendCACTUSinthefollowingthreedirections.First,weintendtorelaxtheclusterdeﬁnitionbyallowingsetsofattributevaluesoneachattributewhichare“almost”stronglyconnectedtoeachother.Second,motivatedbytheobservationthatinter-attributesummariescanbein-crementallymaintainedunderadditionanddeletionoftuples,weintendtoderiveanincrementalclusteringalgorithmfromCACTUS.Third,weintendto“rank”theclustersbasedonameasureofinterestingness,say,somefunctionofthesupportofacluster.

Acknowledgements:WethankPrabhakarRaghavanforsendingusthebibliographicdata.

References

[AGGR98]RakeshAgrawal,JohannesGehrke,DimitriosGunopu-los,andPrabhakarRaghavan.Automaticsubspaceclus-teringofhighdimensionaldatafordatamining.InPro-ceedingsoftheACMSIGMODConferenceonManage-mentofData,1998.

[AMS96]RakeshAgrawal,HeikkiMannila,Ramakrishnan

Srikant,HannuToivonen,andA.InkeriVerkamo.FastDiscoveryofAssociationRules.InUsamaM.Fayyad,GregoryPiatetsky-Shapiro,PadhraicSmyth,andRa-masamyUthurusamy,editors,AdvancesinKnowledgeDiscoveryandDataMining,chapter12,pages307–328.AAAI/MITPress,1996.

[BD76]

PeterJ.BickelandKjellA.Doksum.MathematicalStatistics:BasicIdeasandSelectedTopics.PrenticeHall,1976.

[CBM98]E.KeoghC.BlakeandC.J.Merz.UCIrepositoryofmachinelearningdatabases,1998.

[Cou95]TransactionProcessingPerformanceCouncil,May1995.http://www.tpc.org.

[CS96]

P.CheesemanandJ.Stutz.Bayesianclassiﬁcation(au-toclass):Theoryandresults.InU.Fayyad,G.Piatetsky-Shapiro,P.Smyth,andR.Uthurusamy,editors,Ad-vancesinKnowledgeDiscoveryandDataMining,pages153–180.MITPress,1996.

[DER86]I.S.Duff,A.M.Erisman,andJ.K.Reid.DirectMethodsforSparseMatrices.OxfordUniversityPress,1986.[DLR77]

A.P.Dempster,N.M.Laird,andD.B.Rubin.Maximumlikelihoodfromincompletedataviatheemalgorithm.JournaloftheRoyalStatisticalSociety,1977.

[GGR99]

VenkateshGanti,JohannesGehrke,andRaghuRamakr-ishnan.Cactus–clusteringcategoricaldatausingsum-maries.http://www.cs.wisc.edu/vganti/demon/cactus-full.ps,March1999.

[GJ79]

M.R.GareyandD.S.Johnson.Computersandintractability—AguidetothetheoryofNP-completeness.Freeman;BellLab,MurrayHillNJ,1979.

[GKR98]

DavidGibson,JonKleinberg,andPrabhakarRaghavan.Clusteringcategoricaldata:Anapproachbasedondy-namicalsystems.InProceedingsofthe24thInterna-tionalConferenceonVeryLargeDatabases,pages311–323,NewYorkCity,NewYork,August24-271998.[GRS99]

SudiptoGuha,RajeevRastogi,andKyuseokShim.Rock:Arobustclusteringalgorithmforcategoricalattributes.InProceedingsoftheIEEEInternationalConferenceonDataEngineering,Sydney,March1999.[Ley]MichaelLey.Computersciencebibliography.http://www.informatik.uni-trier.de/ley/db/index.html.[Ram97]RaghuRamakrishnan.DatabaseManagementSystems.McGrawHill,1997.

[Sei]J.Seiferas.Bibliographyontheoryofcomputerscience.http://liinwww.ira.uka.de/bibliography/Theory/Seiferas.[Wie]

G.Wiederhold.Bibliographyondatabasesystems.

http://liinwww.ira.uka.de/bibliography/Database/Wiederhold.

因篇幅问题不能全部显示，请点此查看更多更全内容

查看全文