BIOINFORMATICS ORIGINAL PAPER Sequence analysis

2020-07-20 来源：九壹网

BIOINFORMATICS

Sequenceanalysis

ORIGINALPAPER

Vol.23no.12007,pages5–13doi:10.1093/bioinformatics/btl549

Oligonucleotideﬁngerprintidentiﬁcationformicroarray-basedpathogendiagnosticassays

WaibhavTembe,NelaZavaljevski,ElizabethBode1,CatherineChase1,JeanneGeyer1,LeonardWasieloski1,GaryBenson2andJaquesReifmanÃBiotechnologyHPCSoftwareApplicationsInstitute,TelemedicineandAdvancedTechnologyResearchCenter,USArmyMedicalResearchandMaterielCommand,Ft.Detrick,MD,1DiagnosticSystemsDivision,USArmyMedicalResearchInstituteofInfectiousDiseases,Ft.Detrick,MDand2DepartmentsofBiologyandComputerScience,BostonUniversity,Boston,MA,USA

ReceivedonJune15,2006;revisedonOctober18,2006;acceptedonOctober21,2006AdvanceAccesspublicationOctober26,2006AssociateEditor:JohnQuackenbush

ABSTRACT

Motivation:AdvancesinDNAmicroarraytechnologyandcomputa-tionalmethodshaveunlockednewopportunitiestoidentify‘DNAfin-gerprints’,i.e.oligonucleotidesequencesthatuniquelyidentifyaspecificgenome.Wepresentanintegratedapproachforthecomputa-tionalidentificationofDNAfingerprintsfordesignofmicroarray-basedpathogendiagnosticassays.WeprovideaquantifiabledefinitionofaDNAfingerprintstatedbothfromacomputationalaswellasanexperimentalpointofview,andtheanalyticalproofthatallinsilicofingerprintssatisfyingthestateddefinitionarefoundusingourapproach.Results:Thepresentedcomputationalapproachisimplementedinanintegratedhigh-performancecomputing(HPC)softwaretoolforoligonucleotidefingerprintidentificationtermedTOFI.WeemployedTOFItoidentifyinsilicoDNAfingerprintsforseveralbacteriaandplas-midsequences,whichwerethenexperimentallyevaluatedaspotentialprobesformicroarray-baseddiagnosticassays.Resultsandanalysisofapproximately150insilicoDNAfingerprintsforYersiniapestisand250fingerprintsforFrancisellatularensisarepresented.

Availability:Theimplementedalgorithmisavailableuponrequest.Contact:jaques.reifman@us.army.mil.

INTRODUCTION

Therecentadvancesingenomicsequencingandtheavailabilityoflarge-scalesequencedatabaseshaveunlockedseveralopportunitiestoidentify‘genomicsignatures’or‘DNAﬁngerprints’,i.e.shortDNAsequencesthatuniquelyascertainthepresenceorabsenceofcausativebiologicalagents,suchasviruses,bacteriaorvirulentgenes.Forexample,avastnumberofDNA-baseddetectionanddiagnostictechnologiesarebeingdevelopedtoquicklyidentifybiologicalthreatagents(Ivnitskietal.,2003;Slezaketal.,2003;Draghicietal.,2005;KaderaliandSchliep,2002),suchastheanthrax-causingbacterium,Bacillusanthracis,andtheplague-causingbacterium,Yersiniapestis.DNAsignaturescouldalsobeusedtodetectthepresenceofoneormorevirulentgenes,suchasBacillusgenes,whichencodeimportantvirulencefactors,entereotoxinsandexotoxins(Sergeevetal.,2006),andtoprovide

ÃTowhomcorrespondenceshouldbeaddressed.

high-resolutiondifferentiationbetweencloselyrelatedmicroorgan-ismsinmicrobialforensics(Willseetal.,2004).Newvirusesandstrainshavebeenidentiﬁedusingaspecialmicroarraytechnologyconsistingofapproximately1100070meroligonucleotides(Wangetal.,2002).DNAﬁngerprintshavealsobeenusedtodevelopdiagnosticassaysforawide-rangeofimportantapplicationsinmedicine,environmentalmonitoringandqualitycontroloffoodproducts(Hardiman,2003;JoosandFortina,2005;Wangetal.,2002;Abbeetal.,2004).

ThespeciﬁcalgorithmimplementedinaDNAﬁngerprintidenti-ﬁcationmethodisselectedbasedon(1)whethertheDNAﬁnger-printsarebeingsoughtforaspeciﬁcpathogenstrain(e.g.Y.pestisCO92),agroupofpathogensfromthesamespecies(e.g.allY.pestisstrains)orgenus(e.g.allYersiniaspecies),orasetoforganismsthatmayormaynothaveanyphylogeneticrelationship(e.g.todetectaviralorabacterialfamily)and(2)theexperimentalconditionsspeciﬁedbytheendapplicationtechnology,suchasPCR(Slezaketal.,2003;Viljoenetal.,2005;Haasetal.,2003;GordonandSensen,2004)orDNAmicroarrays(KaderaliandSchliep,2002;Hardiman,2003;Rahmann,2003;Leberetal.,2005;Nordberg,2005).Theuseofreal-timePCR-baseddetectiontechnologyrequirestheidentiﬁcationofthreeinformativesequences:twoampliﬁcationprimersequencesandanadditionalprobesequence(theﬁnger-print).Theassayrequiresthatprimerhybridizationtakesplaceneartheﬁngerprintand,therefore,imposesconstraintsontheposi-tionoftheprimerandPCR-basedﬁngerprints.Moreover,PCR-basedassaysarequitelimitedintheirmultiplexingcapabilities,asdifferentassaysarerequiredtodetectdifferentpathogenicsequences.Incontrast,microarraysdonotimposeanypositionspeciﬁcconstraintsontheDNAﬁngerprints,andseveralﬁnger-printscanbesimultaneouslyplacedonamicroarraytoprovidedetectionredundancyandallowforthediagnosisofmultiplepatho-gensonasingleassay.Despitetheseadvantages,microarray-basedassaysarerelativelyinsensitiveandslowcomparedtotheexquisitesensitivityandspeedofPCR-basedassays.Microarraysensitivitycanbegreatlyenhancedbyincorporatingsampleampliﬁcationpriortohybridizationbut,unfortunately,thisresultsinanetincreaseinassaytimeforalreadyslowassays.

ThispaperisconcernedwiththeidentiﬁcationofDNAﬁnger-printsforspeciﬁc,singlepathogenicsequences,referredtoasthe

ÓTheAuthor2006.PublishedbyOxfordUniversityPress.Allrightsreserved.ForPermissions,pleaseemail:journals.permissions@oxfordjournals.org

W.Tembeetal.

target,forthedesignofDNAmicroarray-baseddiagnosticassays.Thetargetcouldbeanentiregenome(e.g.B.anthracisAmes),achromosome(e.g.Brucellamelitensisbiovarabortus2308chromo-someII)oranon-chromosomesequence(e.g.B.anthracisplasmidpXO2).Amoregeneralproblem,notaddressedhere,involvestheidentiﬁcationofDNAﬁngerprintscommontomultiplestrainsormultiplespecies.AneffectiveapproachforthisproblemistousemultiplegenomealignmentandsearchforconservedregionstoidentifycommonDNAsignatures(Slezaketal.,2003).Identiﬁca-tionofcommonDNAsignaturesbecomesanevengreaterproblemforhighlyvariableRNAviruses(Gardneretal.,2004),whereapromisingsolutionistoselectcombinationsofnon-uniqueprobesanduseuniquehybridizationpatternstounambiguouslyidentifyspeciﬁcviralstrains(Urismanetal.,2005;Schliepetal.,2003).Giventhelonglengthofmosttargets,theidentiﬁcationofDNAﬁngerprintsisaproblemofhighcomputationalcomplexity.Thepotentialsolutionspaceisextremelylargebecauseeverysubse-quenceofthetargetsequenceneedstobeconsidered.Furthermore,thedeterminationofuniquenessrequirescomparisonwithnucleot-idedatabases,suchastheGenBank(Bensonetal.,2005),thataregrowingexponentiallyinsize.Moreover,uniquenessofDNAﬁn-gerprintsobtainedusingsuchcomparativealgorithmsisonlyvalidwithrespecttothereferencedatabaseused.Asnewsequencesaremadeavailable,previouslyidentiﬁedﬁngerprintsneedtoberevalidated.

SeveralpracticalchallengesariseintheexperimentalevaluationofthecomputationallyidentiﬁedDNAﬁngerprintsformicroarray-basedassays.TheDNAﬁngerprintsshouldproduceahighresponsewhenhybridizedwithasamplecontainingthetargetgenome.Con-versely,theresponseforanynon-targetgenomeshouldbeaslowaspossible.Thus,algorithmsforDNAﬁngerprintidentiﬁcationmustincludeexperimentalconstraintsandDNA–DNAhybridiza-tionmodelingmethodstopredicttheresponseonamicroarray.AlthoughresearchinmodelingmolecularlevelinteractionsbetweenDNAsequenceshasmadesigniﬁcantprogress,thereis,unfortunately,noanalyticalmethodavailabletodaythatcanpredicttheexactoutcomeofahybridizationreactionbetweentwoormorearbitraryDNAsequences(SantaLuciaandHicks,2004;Nordberg,2005).Moreover,duetothevariabilityintheoutcomeofahybrid-izationexperiment,alargenumberofrepetitionsarerequiredtoexperimentallyevaluatetheDNAﬁngerprints.Itmightnotbepos-sibletoexperimentallytestallthecomputationallyidentiﬁedﬁn-gerprintsbecauseoftheassociatedcostsandlimitedresources.Tosimultaneouslyaccommodatethesecomputationalneedsandexperimentalconstraints,DNAﬁngerprintidentiﬁcationtools(KaderaliandSchliep,2002;Rahmann,2003;Leberetal.,2005)integratecomputationalalgorithmsforidentifyinguniquesequencesandDNAhybridizationmodelingtoolsforpredictingtheoutcomeofthemicroarrayexperiment.Often,thesetoolsapplyvariousapproximationstoreducethecomputationalcomplex-ity.Forexample,theintegratedapproachofKaderaliandSchliep(2002)usesanefﬁcientsearchalgorithmbasedonsufﬁxtreesandasimpliﬁedtwo-statetransitionnear-neighborthermodynamicmodelforDNAprobedesignandcross-hybridization.Thesimpli-ﬁedmodelreducesthecomputationaltimebutintroducesmodelingerrorsintheDNAﬁngerprintdesign.Asimilarthermodynamicmodelwithacomputationallymoreefﬁcientapproachwaspro-posedbyRahmann(2003).Anefﬁcientfractionalprogrammingapproachformelting-temperaturecomputationwithanimproved6

two-statetransitionnear-neighborthermodynamicmodelhasalsobeenproposed(Leberetal.,2005),however,computationaltimewouldstillbeanissueforcross-hybridizationevaluationofalargenumberofnon-targetgenomes.

ComputationalandexperimentalfactorsmakequantiﬁcationoftheuniquenessorspeciﬁcityofashortDNAsequencechallenging.Infact,aliteraturesurveyindicatesthattherelatedstudieshavenotstatedaprecise,quantitativedeﬁnitionofaDNAﬁngerprint.Althoughthegeneralideaistosearchthetargetgenomefor‘unique’DNAsequencesandthentestthemexperimentally,theinsilicocriterionforuniquenesshasnotbeenexplicitlystated.Inthispaper,weﬁrstprovideaformaldeﬁnitionofaDNAﬁngerprintbasedonvariousexperimentalconditionsandaspeciﬁcitycriterion.Wethendescribeanintegratedapproachthatcombinesefﬁcientbioinformaticsalgorithms,takesintoaccountexperimentalcon-straints,andincludesalarge-scalecomparisonofDNAﬁngerprintswithnucleotidedatabases.Next,wedescribethealgorithmunder-lyingTOFI(toolforoligonucleotideﬁngerprintidentiﬁcation),itssoftwareimplementationonahigh-performancecomputing(HPC)platform,andananalyticalapproachtochoosetheinputparametersofTOFI,whichguaranteesthatallpossibleDNAﬁngerprintssat-isfyingthestateddeﬁnitionareobtained.Finally,wediscussinitialexperimentalresults,whichhelpevaluateourdeﬁnitionofaDNAﬁngerprintandtheassociatedspeciﬁcitycriterion.

TERMINOLOGYANDPROBLEMDEFINITION

ADNAﬁngerprintforagiventargetgenomegtisdeﬁnedwithrespecttoareferencenucleotidesequencedatabasedenotedbyG¼{g1,g2,...,gn,...,gN}thatcontainsNsequences.Inpractice,GconsistsofDNAsequencesfromoneormorepubliclyavailablecomprehensivedatabases,suchasGenBank,oranyothersmallernucleotidedatabase,suchasaviralDNAsequencedatabase.ThetargetgenomemayormaynotbelongtoG,implyingthatitcouldbeaknownpathogenoranewlysequencedone.ImplicitinthedeﬁnitionofaDNAﬁngerprintisitsvaliditywithrespecttotheavailabledatabase.AsnewerDNAsequencesbecomeavailableandareaddedtoG,itisnecessarytoverifythevalidityofthepreviouslyidentiﬁedﬁngerprints.

Basedontheschoolofthought,computationalorbiological,thedeﬁnitionofaDNAﬁngerprintvaries.Therefore,somediscussionaboutournotionofaDNAﬁngerprintisinorder.

Fromapurecomputersciencestandpoint,aDNAﬁngerprintofgtcouldbedeﬁnedas‘anysubsequenceofgtthatisnotasubse-quenceofanygn2G,n¼t’.Bythisdeﬁnition,theproblemofidentifyingDNAﬁngerprintsisequivalenttotheclassicstringcomparisonproblemofidentifyingsubstringsofgtthatdonotexactlymatchanysubstringofanygn2G,n¼t.Althoughmathe-maticallycorrect,thisdeﬁnitionlackstheapplication-speciﬁcrequirements.DNAﬁngerprintshavetosatisfy:(1)designcon-straints,sothattheycanbeusedasDNAprobesonmicroarraysand(2)speciﬁcityconstraints,sothattheycandiscriminate,inamicroarrayhybridizationreaction,betweentargetandnon-targetsequences.DNAﬁngerprintsthatsimultaneouslysatisfybothdesignandspeciﬁcityconstraintsrequireabiologicallymoresounddeﬁnition.Wemathematicallyformalizetheexperimentalandspeciﬁcityconstraintsasfollows.

LetKdenotetheDNAmicroarrayexperimentalconstraints,suchastheminimumandmaximumlengthoftheDNAﬁngerprint,the

Oligonucleotidefingerprintidentification

hybridizationmelting-temperature,GCcontent,etc.(SantaLuciaandHicks,2004),andletP¼{p1,p2,...,pi,...,pI}denotethesetofallsubsequencesofgtthatsatisfyK.Thus,bydeﬁnition,everypi2Pwillhavelengthwithinthespeciﬁedminimum(Lmin)andmaximum(Lmax)bounds,GCcontentwithintherequiredrange,andwillsatisfyseveralotherpropertiesspeciﬁedbythechosenDNAhybridizationmodelingmethodology.WerefertothesequencesinPasDNAprobes.NotethatconstraintsdenotedbyKdonotspecifyattributesregardingtheuniquenessofaDNAprobewithrespecttonon-targetsequences.

QuantifyingspeciﬁcityofDNAﬁngerprintsfromanexperimentalpointofviewisverysubjective.ItisbasedoninterpretingtheexperimentalhybridizationresultsbetweenDNAprobesandnon-targetDNAsequences.Sincethereisalackofaccurateinsilicohybridizationmodels,weinferthespeciﬁcityofaDNAprobebyﬁrstcomputingDNAsequencealignments,andthendeterminingifthealignedprobemeetsanempiricalthresholdT.ThisisbasedonthehypothesisthatDNAsequencesthatalignpoorlyareunlikelytoformastableDNA–DNAduplexforagivensetofexperimentalconstraints.ThishypothesisimpliesthattheDNAsequencealignment,calculatedstrictlyusingcom-putationaltools,providesquantiﬁcation,throughthethresholdT,oftheactualstrengthoftheDNA–DNAduplex.Thus,wecomputethespeciﬁcityofaDNAprobefromthenumberofmismatches,gapsandinsertions/deletionsinthealignmentandcompareitwiththethresholdT,representingthesetUofallspeciﬁcityconstraints.Havingformalizedtheexperimentalandspeciﬁcityconstraints,wedeﬁneaDNAﬁngerprintandtheproblemofidentifyingallDNAﬁngerprintsforatargetgenomeasfollows:

Deﬁnition(DNAﬁngerprint):ADNAprobepioflengthLiisconsideredaDNAﬁngerprintofgtifandonlyifanoptimalsequencealignmentbetweenpiandanyothersequencegn2G,n¼t,hasatmostLi-Tmatches.

Deﬁnition(DNAﬁngerprintidentiﬁcation):ForatargetDNAsequencegt,ﬁndallinsilicoDNAﬁngerprintsthatsatisfytheexperimentalconstraintsKandspeciﬁcityconstraintsUwithrespecttoareferenceDNAsequencedatabaseG.

LetS¼{s1,s2,...,sf,...,sF}beasubsetofP,i.e.S󰀁P,thatdenotesthesetofDNAprobesthatsatisfybothconstraintsKandU.OurgoalistoﬁndallFelementsofS.WerefertotheelementsofSasinsilicoDNAﬁngerprintsbecausetheysatisfyallconstraintsthathavebeenquantiﬁedforcomputationalpurposes.Theirexperi-mentalvalidityneedstobetestedinanactualDNAmicroarrayexperiment.Unlessstatedotherwise,henceforththeterm‘DNAﬁngerprint’impliesinsilicoDNAﬁngerprint,whichisvalidwithrespecttoareferencedatabaseused.

Fig.1.TheThreeStepsoftheTOFIAlgorithm.

database.Auser-deﬁnedspeciﬁcityconstraintUisusedtointerpretthealignmentsfromthestandpointofcross-hybridizationbetweentheDNAprobesandnon-targetgenomes.DNAprobessatisfyingUarereportedasDNAﬁngerprintsandaretestedonmicroarrays.

Althoughtheproblemhasbeensplitintothreediscretestepsforclarityofexplanation,theindividualstepsarenotcompletelyinde-pendentfromanalgorithmicstandpoint.Infact,ourobjective,toobtainallDNAﬁngerprints,leadsustoconcludethattheinputparametersintheﬁrststephaveananalyticalrelationshipwiththeconstraintsimposedinthesecondandthethirdsteps.Wediscusstheinterdependenceofthethreestepsafteranin-depthdescriptionofeachofthethreesteps.

Step1:solutionspacereduction

Thesolutionspacetobesearchedisextremelylargebecauseeverysubsequenceofthegiventargetmustbeconsidered.Testingeachsubsequenceexperimentallyisimpracticalandexpensive.Butreducingthesolutionspacecomputationallycanbequickandcheaper.Forthispurpose,weexploitthesequencesimilaritiesbetweenthetargetgenomeandanevolutionarynear-neighbor(gr)thatcanbeidentiﬁedfromaphylogenetictreeorpublisheddata.ThetargetandneighborwillcontaincommonDNAsequences,which,obviously,cannotbeusedasDNAﬁngerprints.DNAsequencescommontobothgtandgrareextractedusingsufﬁxtrees(Wiener,1973;Gusﬁeld,1997),which,withinthedomainofcomparativegenomics,havebeenusedinVmatch(Kurtz,2002)andtheMaximalUniqueMatcher(MUMmer)(Kurtzetal.,2004)toidentifyrepeats,exactorapproximatematches,andsinglenucleotidepolymorphisms.Detailsoftheconstruction,traversalandnumerousapplicationsofsufﬁxtreesinseveraldif-ferentstring-matchingapplicationsareavailablein(Gusﬁeld,1997).Itshouldbenotedthatthesolutionspacecouldbefurtherreducedbycomparingthetargetgenomewithmultiplenon-targetgenomes,asdescribedbySlezakandcolleagues(Slezaketal.,2003).ThiscouldbedonewithinTOFI’scurrentalgorithmiccon-ﬁgurationbyconcatenatingstringsfrommultiplenon-targetgen-omesintoonelongstringandprovidingitasthenear-neighborgenomeforcomparison.Analternate,perhapsmoreefﬁcient

INTEGRATEDAPPROACH

TOFIimplementsamulti-stepapproachthatbreaksdowntheprob-lemofﬁngerprintidentiﬁcationintothethreestepsillustratedinFigure1.TheﬁrststepreducesthesolutionspacebydiscardingDNAsequencescommontoboththetargetsequenceandoneormorebiologicalnear-neighborsequences.Thesurvivingsequencesaretermedcandidatesequences.Inthesecondstep,amicroarrayDNAprobedesignphaseextractsfromcandidatesequencesonlythosesubsequencesthatsatisfytheapplication-speciﬁcexperi-mentalconstraintsK.Inthethirdstep,eachDNAprobeisalignedwithallDNAsequencespresentinthechosenreferencenucleotide

W.Tembeetal.

Table1.TypicalvaluesforDNAprobedesignconstraintsKLength(bases)

Fig.2.OutputoftheSuffix-Tree-BasedAlgorithm.

Minimum,Lmin:35Maximum,Lmax:40

Melting-temperature(󰀂C)Minimum,Tmin:70Maximum,Tmax:75

GCcontent(%)45–50

approach,whichwouldrequiresoftwaremodiﬁcation,istocomparethetargetsequentiallyagainstalistofnon-targetsequences,sothataftereachcomparisononlyunmatchedsequencesarecomparedwiththesubsequentnon-targetgenomesfromthelist.

Oncetheexactmatchesbetweenthetargetgtanditsnear-neighborgrareidentiﬁed,thetargetcanberepresentedasacon-catenationgt¼C0M1C1M2C2...MJCJ,asshowninFigure2.MjdenotesthejthexactmatchoflengthjMjjandCjdenotesthejthcandidatesequence,i.e.asequenceinthetargetthatcontainsnomatchesoflengthMorlongerwiththenear-neighbor.C0and/orCJcanbenullbasedonwhetherornotthereisanexactmatchatthebeginningortheendofgt,respectively.ExactmatchesthatarelongerthantheminimumlengthMarediscardedandonlythecandidatesequencesareretainedforfurtherconsideration.Thecandidatesequenceshavenorestrictionwithrespecttotheirlength,positioninthegenome,orcompositionofbasepairs.Ofparticularinteresttoourapplicationisthechoiceofinputparameterstothesufﬁx-tree-basedalgorithm,inparticular,theminimumlengthMofexactmatchesbetweengtandgrthatwouldleadtoidentiﬁcationofallDNAﬁngerprints.OuranalysisindicatesthattheparameterMiscloselyrelatedtotheexperimentalandspeciﬁcityconstraints,detailedinSteps2and3,respectively.Therefore,weﬁrstdescribetheremainingtwostepsofTOFIbeforeananalyticalrelationisderivedbetweentheparameterMandtheconstraintsimposedbytheproblemdeﬁnition.

ThisselectionofMdiffersfromarelatedstudy(Slezaketal.,2003),whereM¼18washeuristicallyselectedtomeettheminimalPCRprimersize.

Step3:specificitydeterminationbysequencealignment

Inthethirdstep,everyDNAprobeisalignedwithsequencesinthereferencenucleotidedatabase.Theresultsofthealignmentsareinterpretedtopredictcross-hybridizationusingthefollowinggeneralrule:ADNAprobethatalignspoorlywithallnon-targetsDNAsequencesisunlikelytocross-hybridizewithnon-targetsand,therefore,shouldbeconsideredasaﬁngerprint.

DuetothelimitationsofDNA–DNAhybridizationmodels,deter-miningthealignmentcorrespondingtotheoptimalDNA–DNAduplexonamicroarrayishard.Computationally,optimalalignmentbetweentwoDNAsequencescouldbedeﬁnedusingthegeneralizededitdistancealgorithm(Gusﬁeld,1997).Simplyput,theeditdis-tancebetweentwosequencescorrespondstothetotalnumberofinsertions,deletionsandsubstitutionsthatareneededtotransformonesequenceintotheother.FromthestandpointofDNAcross-hybridization,asubstitutioncorrespondstoamismatchedpairofnucleotidesandinsertions/deletionscorrespondtogapsintheDNA–DNAduplex.Thelowerthenumberofmismatchesandgapsinthealignmentisthelowerintheeditdistance.However,editdistancedoesnotprovidesufﬁcientinformationwithregardstothestrengthofhybridization.Forexample,itdoesnotconsiderthepositionofmatchesinthealignment,GCcontent,gaplengthandthelongestcommonfactorinthealignment(Rahmann,2003).Moreover,computingtheoptimalalignmentbetweeneachDNAprobeandeveryDNAsequenceinalargedatabase,suchasthe‘nt’nucleotidedatabasefromtheNationalCenterforBiotechnologyInformation(NCBI,http://www.ncbi.nlm.nih.gov/)(Pruittetal.,2005),wouldbeverycomputationallyintensive.

Basedontheseissuesandpracticaltimeconstraints,weoptedtousetheBLASTNprogramfromBLAST(Altschuletal.,1990)foraligningDNAprobeswithareferencedatabase.ThealignmentalgorithminBLASTisaheuristics-basedapproachthatstartsoffbyidentifyingawordofexactmatchofagivenlength(wparameter)andproceedsbyextendingitusingdynamicprogrammingtoallowmismatchesandgapsinthealignment.AstatisticalsigniﬁcancescoretermedE-valueisusedtodistinguishbetweenpotentiallymeaningfulalignmentsandchancealignments.TheE-valuescorewasusedin(Draghicietal.,2005)toquantifythespeciﬁcityofDNAprobes.However,E-valuesaredeterminedbythelengthofthealignment,sizeofthequery,sizeofthetotaldatabaseandseveralotherparametersthatarenotrelatedtotheabilityofaprobetoformacross-hybridwithanon-targetgenome.

InsteadofusingE-valuealonetodetermineprobespeciﬁcity,TOFIexaminestheactualalignmentsreportedbyBLASTanddeterminesthespeciﬁcityofaprobebytakingintoaccountthenumberofmatches,mismatchesandgapsinthealignmentinde-pendentofitsstatisticalsigniﬁcance.ThesespeciﬁcityconstraintsUformthebasisfortheempiricalthresholdT,usedinthefollowing

Step2:microarrayprobedesign

ThesecondstepimposesasetofexperimentalconstraintsKtoextractDNAmicroarrayprobesfromthecandidatesequences.Arecentreview(PanjkovichandMelo,2005)indicatesthatforthesameinputDNAsequencesdifferentinsilicoprobedesignmod-elingtools,notsurprisingly,producedifferentsetsofDNAprobes.Toourknowledge,thereisnouniversallyacceptedmodelingmeth-odologyavailabletodaytodesignmicroarrayprobesfromDNAsequences.Often,thesetoolsareusedinaniterative,trial-and-errorfashiontooptimizethequality/numberofoutputDNAprobestosuitetheapplication-speciﬁcneeds.

Wehaveselectedaprobedesigntoolthatimplementsamulti-statethermodynamicmodelformelting-temperature(SantaLuciaandHicks,2004).ThemodelallowsfortherepresentationofseveraldozenconstraintsontheDNAprobes,suchasprobelength,GCcontent,molarconcentrations,self-hybridizationpossibilitiesandlimitonthenumberofsinglenucleotiderepeats.AdditionalinformationontheconstraintsKcanbefoundinSantaLuciaandHicks(2004).Asanexample,onlyafewimportantconstraintsareshowninTable1.TheDNAprobessatisfyingtheseconstraintsareextractedfromeverycandidatesequenceandarepassedontothenextstep.8

Oligonucleotidefingerprintidentification

Table2.AlgorithmforcandidatesequenceselectionusingthesuffixtreeoutputVariables:

Procedure:

Input:Lmax,Lmin,T,gtandgrgt¼Targetgenome

gr¼Near-neighborgenomeofgt(1)Pcand¼emptysetLmin¼Minimumprobelength(2)LetE¼LmaxÀTLmax¼Maximumprobelength(3)LetM¼E+1¼LmaxÀT+1T¼Thespecificityconstraint,i.e.thecombinedminimum(4)Usingsuffixtree,identifyexactmatchesM1,...,MJoflengthnumberofmismatches,insertionsanddeletionsintheoptimalatleastMbetweengtandgr,andcandidatesC0,C1,...,CJfromgtalignmentbetweenaprobeandanynon-targetsequence,definingafingerprint(5)ForeachcandidatesequenceCjfromthesuffixtreeoutput,M¼Inputparameterofthesuffix-tree-basedalgorithmthatspecifies(A)IfCjisaprefixofgt,thenextendCjbyEbasestotheminimumlengthofexactmatchesbetweengtandgrtheright.GotostepD

(B)IfCjisasuffixofgt,thenextendCjbyEbasestoMj¼jthexactmatchbetweengtandgr,wherej¼1,...,J

Cj¼jthcandidatesequence,j¼0,1,...,J,definedasasubsequencetheleft.GotostepD

(C)ExtendCjontheleftandtherightbyEbasesofgtthatis:

boundedonbothsidesbyexactmatchesoflengthatleastM,or(D)AddallsubsequencesofCjsatisfyingthelengthlocatedbetweenthestartofgtandthefirstexactmatchoflengthatleastconstraintstoPcandM,orlocatedbetweenthelastexactmatchoflengthatleastResult:EverycandidateDNAprobeofgtsatisfyingconstraintsMandtheendofthegenomegtLmax,Lmin,andTwillbeinPcandE¼Lengthoftheextensionofcandidatesequencesintotheadjacentexactmatch(es)

hypothesistoinferhybridizationpatternsfrom‘optimal’BLASTalignments:IfthebestBLASTalignmentbetweenaDNAprobeandanon-targetgenomehasmorethanTmismatchesorgaps,thentheDNAprobewillbeconsideredasaninsilicoDNAﬁngerprint.ItiswelldocumentedthatusingBLASTforassessmentofcross-hybridizationofaprobewithnon-targetgenomeswillresultinsomenon-speciﬁcprobes(Rahmann,2003;Nordberg,2005).Ifawordoflengthwisnotfoundinadatabasesequence,theprobealignmentwiththesequencewillbeskippedresultinginpotentialmissedcross-hybridization.Inothersituations,partialalignmentswithprobesmayresultinunderestimatedcross-hybridization.Twopromisingapproachescouldbeconsideredtoimproveprobespe-ciﬁcity.First,additionalﬁlteringoftheprobesselectedasﬁnger-printsbyBLASTcouldbeperformedtoaugmentthehypothesisrelatingalignmenttohybridization.Inthiscase,additionalinforma-tionwouldbeextractedfromanalysesofthealignmentsreportedbyBLAST,suchasthemaximumnumberofcontiguousmatchesorthepositionofmatchesintheprobealignment,andusedasrulestoimprovespeciﬁcityconstraints.Second,betteralignmentalgo-rithmscouldbeimplementedasapost-processingstep,whichwouldincorporatehybridizationthermodynamicsintothealign-mentevaluationtotakeintoaccounthybridizationstability(Leberetal.,2005).However,itmustbeemphasizedthatthelackofanaccuratemodeltodirectlyrelateaDNAsequencealign-mentwithitscorrespondingDNA–DNAhybridizationleavesthechoiceofprobespeciﬁcitycharacterizationasanopenquestion.

TOFIparameterselection

ThethreestepsinTOFIimplementdifferentbioinformaticsalgo-rithms,eachcarryingoutadifferenttaskusingitsownsetofinputparameters.However,theminimumlengthoftheexactmatchesMintheﬁrststepisanalyticallyrelatedtothelengthconstraintsLminandLmaxontheDNAprobesinthesecondstepandthespeciﬁcitythresholdTusedinthethirdstep.Inthissection,wemathematicallyderivethisanalyticalrelationship.

TheproblemofselectinganappropriateMvaluecouldbestatedasfollows:giventhelengthconstraintsLminandLmaxonthelengthoftheprobesandthespeciﬁcitythresholdT,ﬁndarelationshipbetweenLmin,Lmax,TandM,whichguaranteesthatnovalidDNAﬁngerprintsarediscarded.

OurapproachisinitiatedbyextendingeachcandidatesequenceCj,j¼0,1,...,J,byEbasesintoeachsideoftheneighboringexactmatch.ThispreventsthepossiblediscardingofsignaturesthatincludetheboundariesofCj.FromtheextendedcandidatesequencesweconstructacandidateDNAprobesetPcand,whichcontainseverysequencesatisfyingthelengthconstraintsLminandLmax.OnlythoseDNAprobesinPcandthatsatisfytheexperimentalconstraintswillbeincludedintheprobesetPforalignmentwiththereferencedataset,i.e.P󰀁Pcand.

WechooseEsuchthatM>E.Thisconditionguaranteesthattheoverlapsbetweentwoadjacentextendedcandidates,ifany,willbelimitedtotheexactmatchregionseparatingthetwocandidates.ItalsosetsalowerlimitonM,M¼E+1.ToensurethatanycandidateDNAprobeoflengthLihavinglessthanorequaltoLiÀTexactmatchesisnotdiscarded,theextensionlengthshouldbeE¼LiÀT.Thus,theextensionlengthEisconstrainedbyLminÀT E LmaxÀT.SubstitutingforM¼E+1andmakingaconservativeselection,weobtainM¼LmaxÀT+1.Suchselectionwill,mostlikely,generateacandidateprobesetPcandthatcontainssomefalsepositives,i.e.probesthatdonotsatisfythespeciﬁcityconstraints.Themajorityofsuchnon-speciﬁcprobeswillbediscardedaftertheBLASTalignmentinspection.However,somefalsepositiveswillremainduetopossiblemissedmatchesinBLAST,asdescribedintheprevioussection.ThedetailsofthecandidateprobeselectionalgorithmaregiveninTable2.

Finally,weprovethatifthecandidatesequencesareobtainedusingE¼LmaxÀT,withM¼E+1,thenallDNAﬁngerprintsisincludedinthesetPcand.However,asSdenotesthesetofDNAﬁngerprintsforgt,itwillsufﬁcetoprovethatsuchselectionforMguaranteesthatS󰀁Pcand.

W.Tembeetal.

Assertion1.Byconstruction,everysubsequenceofgtcontaininganexactmatchoflengthsmallerthanMisincludedinanextendedcandidatesequence(steps5.A–5.CinTable2).

Assertion2.Fromeachextendedcandidatesequence,everysub-sequencesatisfyingthelengthconstraintsLminandLmaxisincludedinPcand(step5.DinTable2).

Assertion3.Fromassertions1and2,noneofthesequencesinPcandcancontainanexactmatchoflengthMorgreater.

Assertion4.Bydeﬁnition,aDNAﬁngerprintsi2ScontainsatmostLiÀTexactmatcheswhenalignedwithanynon-targetgenome.Thus,thelengthofthelongestexactmatchbetweensiandanyothernon-targetgenomeisLiÀT.Since,LiÀT LmaxÀT¼ERESULTS

Softwareimplementation

WeusedMUMmer(Kurtzetal.,2004),anopensourcesoftwarethatimplementsasufﬁx-tree-basedalgorithmandprovidesseveraloptionsforcomparinggenomicsequences.ThemicroarrayDNAprobedesignfromcandidatesequenceswascarriedoutusingthecommercialsoftwareoligonucleotidemodelingplatform(OMP)(availableathttp://www.dnasoftware.com),whichimplementsastate-of-the-arthybridizationmodel(SantaLuciaandHicks,2004).TheBLASTNprogramfromNCBI_BLAST(version2.2.10)wasusedforaligningmorethan2.0millionnucleotidesequencesstoredinthe‘nt’nucleotidedatabaseattheNCBI.Thedatabasehasgrownsigniﬁcantlysinceweobtainedtheresultsdescribedinthispaperandwehavedownloadedthelatestversion,containingmorethan3.6millionsequences,forfuturerunsofTOFI.TheentiresoftwarepipelinewasinitiallyimplementedonaHPCenvironmentattheAdvancedBiomedicalComputingCenter(http://www-fbsc.ncifcrf.gov/)usingtheHighThroughputCom-putingsupportfromSGIÒonanAltixclusterconsistingof64·1.5GHzItanium2processorsrunningRedHatÒLinuxwith64GBofsharedmemory.Oncecandidatesequencesareobtained,TOFItakesadvantageoftheparallelprogrammingoppor-tunitiesonHPCresources.TheDNAprobedesignusingOMPhasbeenparallelizedusingOpenMPbyschedulingDNAprobedesignforeachcandidatesequenceonaseparateprocessor.TheexecutionofBLASTN,byfarthemostcomputationallyintensivepartofTOFI,isparallelizedbyassigningbatchesofDNAprobestosepa-rateprocessors.Inaddition,severalapplication-speciﬁcsoftwaremodulestoprocessDNAsequences,tocompileresultsoftheinter-mediatestagesforanalysis,andtoprocessoutputsofvariousstageswereimplemented.ThischoiceofresourcesandsoftwareisjustonewaytoimplementTOFI’sintegratedapproachshowninFigure1.Adifferentchoiceofsoftwareforsufﬁxtree,DNAprobedesignandsequencealignmentscouldbeusedaswell.How-ever,ourparticularchoicerepresents,arguably,someofthebesttoolsavailableforeachofthethreesteps.

TOFIhassincebeenportedontooneoftheU.S.DepartmentofDefenseMajorSharedResourceCenter’sLinuxclusters,consistingof128dualprocessornodesonadistributedmemorysystem,wheredeploymentofmpiBLAST(Darlingetal.,2003)andexecutionofOMPonseparateprocessorsisbeingtested.Inthecurrentclusterimplementation,weusempiBLASTwith32processorsrunninginparallel,which,again,consumesthebulkofthecomputingtime.10

Fig.3.IdentificationofY.pestisDNAFingerprintsUsingTOFIona32-CPULinuxCluster.

ThecomputationaltimeofthealgorithmdependsonthenumberofprobesgeneratedastheoutputofStep2andprovidedtompi-BLASTandonthesizeofthereferencedatabase.Thenumberofprobes,inturn,dependsonthelengthofthetargetgenome,theavailabilityandsimilarityofanear-neighborgenome,andtheselectedprobedesignconstraints.Thereferencedatabaseisseg-mentedaccordingtorulesofthumbsuggestedbythempiBLASTdevelopers(Darlingetal.,2003),wherethenumberofdatabasesegmentsissettothenumberofprocessors.Hence,thecomputa-tionaltimeofprocessingthereferencedatabaseisdirectlydepen-dentonthespeedupachievedbympiBLAST(Darlingetal.,2003).Theexecutiontimecouldbeimprovedbyusingotherparallelver-sionsofBLAST,suchaspioBLAST(Linetal.,2005).

Casestudy:DNAfingerprintsforY.pestis

TOFIwasusedtoidentifyDNAﬁngerprintsfortheplague-causingpathogenY.pestisstrainCO92(accessionno.NC_003143.1).Basedontheliterature(Chainetal.,2004),acloselyrelatedorganism,YersiniapseudotuberculosisstrainIP32953(accessionno.NC_006155.1),wasselectedasthenear-neighborgenome.

TheempiricalthresholdT¼15wasselectedbasedonapriorianalysisofhybridizationpropertiesfortheselectedmicroarraytechnologyandexperimentalsetup,suchastheprobelengthandrequiredmelting-temperature.Thisparametercanbeadjustedonacase-by-casebasisusingfeedbackfromadditionalexperimentalevaluations.ForT¼15andmaximumprobelengthLmax¼40,accordingtostep3inTable2,theminimumlengthofexactmatchesinMUMmerisM¼26.

Approximately96%oftheY.pestisgenomewasdiscardedusingMUMmerintheﬁrststep(Fig.3).Thustheideaofusinganear-neighborgenometoidentifyanddiscardexactmatchesprovedtobeextremelyeffectiveinthiscase.Outofabout4.6millionbasesofY.pestis,fewerthan200000bases,distributedunevenlyamong2222candidatesequences,wereconsideredfurther.Inthenextstep,slightlyover13600DNAprobessatisfyingtheexperimentalconstraintswereextractedfromthecandidatesequences.

IntheBLASTprobespeciﬁcityevaluation,theseedsizew¼7wasselectedbecauseitwasthesmallestvalueavailableinthe

Oligonucleotidefingerprintidentification

BLASTversionthatweused,andalargeE-value¼100reducedthepossibilityofmissinghighscoringalignments.

BasedonthespeciﬁedTOFIparameters,allbut146DNAprobeswererejected,deﬁningtheinsilicoDNAﬁngerprintsthatwereselectedforexperimentalevaluationusingcustomDNAmicroar-rays.TheseDNAﬁngerprintsunderwentfurtherscreeningbasedonadditionalexperimentalconstraints,suchasthepresenceofrestrictionenzymecleavagesites,leavingonly99insilicoﬁnger-printsfortesting.

Experimentalevaluationofinsilicofingerprints

TencustomizedDNAmicroarraychips,eachcontainingseveralreplicatesofthe99insilicoDNAﬁngerprintsandanumberofcontrolsequences,werefabricatedandusedforevaluationpur-poses.SixchipswerehybridizedwiththetargetgenomeY.pestisandfourchipswereusedtotestcross-hybridizationwiththenear-neighborgenomeY.pseudotuberculosis.Normalizeddatawereusedtocomparehybridizationsignals.

Themicroarrayhybridizationdatawereusedtoanalyzethedis-criminatingpoweroftheinsilicoﬁngerprintsbycomparingtheexperimentalhybridizationresultsoftheprobeswithY.pestisandY.pseudotuberculosis.Figure4illustratesasamplesetofdatashow-ingthenormalizedresponse(y-axis)asafunctionoftheDNAﬁngerprints,whicharearrangedindescendingorderofthediffer-encebetweentheirresponseswithY.pestisandY.pseudotuberculo-sis.Variabilityinthehybridizationresponsesinrepeatedexperimentsispresentedbystandarderrorbarsforeachprobe.Outofthe99DNAﬁngerprintstested,20(datanotshowninFig.4)producedhigheraverageresponseforY.pseudotuberculosisthanthatforthetarget.Thisisduetocomputationalandexperi-mentalreasons.ThecomputationalreasonsrelatetolimitationsofusingBLASTforspeciﬁcityevaluation,asdiscussedinSection3,step3.Adetailedpost-experimentalanalysisoftheBLASToutputsindicatesthat12outofthe20probesdonothavereportedalign-mentswithY.pseudotuberculosisinthesigniﬁcanthitlist.Fortheremainingeightprobes,contiguousmatchesof20basesormorewereobservedintheBLASTalignmentsbutthecalculatedsumofmismatchesandgapswaslargerthan15,causingtheseprobestobeidentiﬁedasﬁngerprints.ThistypeofproblemcouldbeavoidedifathresholdforthemaximumnumberofcontiguousmatchescouldbeexperimentallydeterminedandusedforadditionalﬁlteringoftheprobesthatpassedtheﬁrstBLASTspeciﬁcitytesting.Theexperimentalreasonsrelatetothevariabilityofproberesponses.Althoughallofthe20probeshavelargermeanresponsesforY.pseudotuberculosisthanthatforY.pestis,onlysixoftheseprobeshavesigniﬁcantlylargerresponses.Theexperimentalreasonfortheobservedaberranthybridizationofthesesixprobesisnotclear.These20probeswereexcludedfromfurtherevaluation.

Foraﬁngerprinttobeusefulinadiagnosticassay,itshouldyieldaverylowresponsefornon-targetsandahighresponseforthetarget.Thus,afewDNAﬁngerprintsinFigure4thathaveagooddiscriminatorypowerbuthavearelativelyhighresponsefornon-targetswouldnotbeconsideredusefulondiagnosticassays.ThedatausedinFigure4canalsobeusedtoidentifyvalidﬁngerprintsbasedonalternaterules,suchasidentifyingquantiﬁablethresholdvaluesfortargetandnon-targetresponses.Forexample,25probescouldbeselectedbyusingaminimumthresholdvalueof2.0forY.pestisresponsesandamaximumthresholdvalueof1.0forY.pseudotuberculosisresponses,while20probescouldbeselected

Fig.4.ComparisonofHybridizationofinsilicoFingerprintswithTarget(Y.pestis)andNon-target(Y.pseudotuberculosis).

usingaminimumthresholdof2.0forY.pestisandallowingamaxi-mumthresholdof0.5forY.pseudotuberculosis.Ineachcase,asufﬁcientlylargenumberofprobeswouldallowfordetectionredundancy.

OtherapplicationsofTOFI

Havinganear-neighborgenomeisnotarequirementforTOFI.Thenear-neighborgenomeisusedtoreducethesolutionsearchspaceasmuchaspossibleintheﬁrststep,whichiscomputationallytheleastexpensivestep.Thetargetgenomecouldbecomparedwithanysmallsetofgenomesusingsufﬁxtrees.Thehigherthenumberofmatchesidentiﬁedintheﬁrststepis,theloweristhenumberofcomputationsrequiredinthesubsequentsteps.Inthecurrentstudy,asinglenear-neighborcomparisonreducedthesearchspaceveryeffectively.Inthecaseinwhichacloselyrelatednear-neighborforthetargetisunknown,eitherarbitrarygenome(s)couldbeusedasnear-neighbor(s)ortheﬁrststepcouldbeomitted.Infact,TOFIwassuccessfullyusedtoidentifyDNAﬁngerprintsforplasmidspPCP1,pCD1andpMT1inY.pestiswithoutusinganynear-neighbor.Becauseplasmidsaremuchshorter(aboutafewthousandbases)thanbacterialgenomes(typicallyoverafewmillionbases),thewholeplasmidcouldbeconsideredasasinglecandidatesequenceandsentdirectlyasinputtotheDNAprobedesignstep.

TOFIwasalsousedtoidentifyﬁngerprintsofFrancisellatularensisstrainSCHUS4(accessionno.NC_006570.1).Thegenomicsequenceofthenear-neighborFrancisellaphilomiragiawasnotavailableand,therefore,theﬁrststepofTOFIwasomitted.Inthesecondstep,weusedOMPtoscanthewholeF.tularensisgenome,consistingofabout1.9millionbases.Overlapofadjacentprobeswaslimitedto10basestoreducecomputationtime.OMPidentiﬁedabout20000probes,whichweretestedforspeciﬁcitywithT¼15,resultingin250ﬁngerprints.Furtherscreeningforrestrictionenzymecleavagesitesreducedthenumberofinsilicoﬁngerprintsto121.

Fourchipswerefabricatedusingseveralreplicasofthe121insilicoﬁngerprintsandanumberofcontrolprobes.TwochipswereusedtotesthybridizationwithF.tularensisandtheothertwototestcross-hybridizationwithF.philomiragia.Weperformed

W.Tembeetal.

initialevaluationusingacriterionsimilartotheoneemployedforY.pestis.IncontrasttotheY.pestishybridizationexperiments,onlyoneprobehadhigheraverageresponsewithF.philomiragiathanwithF.tularensis.Intheexperiment,85probesshowedanormalizedresponsewithF.philomiragiasmallerthan1.0,while81ofthoseprobeshadresponseslargerthan2.0withF.tularensis.Currently,alargenumberofadditionalexperiments,includingastandardpanelofnon-targetgenomes,arebeingperformedtoevalu-atetheﬁngerprintsofY.pestisandF.tularensisbeforetheyareusedasprobesindiagnosticassays.

Currentlimitationsandplansforimprovements

TOFIhasalreadybeenusedinitscurrentconﬁgurationtoidentifyﬁngerprintsforanumberofpathogens.However,severalalgo-rithmicandimplementationissuesaffectitsperformanceandarebeingaddressed.

ThescopeofDNAﬁngerprintidentiﬁcationinTOFIiscurrentlylimitedtoasingletargetsequence.Weareinvestigatingapproachestoselectﬁngerprintscommontoalargenumberofrelatedtargets,whichwouldallowfortheidentiﬁcationofﬁngerprintscommontospeciﬁcspeciesorgenus.ForhighlyvariableRNAviruses,uniqueﬁngerprintsmaynotexist.Forthisapplication,anapproachbasedontheselectionofnon-uniqueprobes,whichtogethermayformuniquehybridizationpatternsonachipforunambiguousviralidentiﬁcation,isalsobeingconsidered.

AlthoughwehaveprovidedadeﬁnitionofaDNAﬁngerprintandanalgorithmthatguaranteesallﬁngerprintssatisfyingitareiden-tiﬁed,validprobescouldpotentiallybediscardedduetosigniﬁcantoverlapwithadjacentprobesduringtheprobedesignphase(Step2ofTOFI).Thisrelatestopracticalconsiderationsinordertoreducethenumberof‘similar’probesonthechip,allowspaceformultiplereplicates,andlimitthetotalnumberofprobes,consideringthatseveralcontrolsequencesneedtobepresentonthemicroarray.ExperimentalevaluationsoftheidentiﬁedinsilicoﬁngerprintsforY.pestisandF.tularensisindicatethepossibilityforimprovementinthealgorithmspeciﬁcity.Itwasfoundthatcross-hybridizationwithnon-targetgenomeswasnotdetectedbyBLASTinabout10%oftheinsilicoﬁngerprintsforY.pestis,whileinanadditional10%oftheﬁngerprintsthecross-hybridizationwasunderestimated.Speciﬁcitywillbeimprovedinthefuturebasedon:(1)theselectionofoptimalTOFIparametersusingcomprehensiveevaluationoftheexperimentalresults;(2)thepost-processingofBLASTalign-mentsusingexpertrulestobettercorrelatealignmentswithhybrid-izationand(3)thedevelopmentofoptimalalignmentalgorithmsthatincludehybridizationthermodynamicsasapost-processingstepaftertheBLASTspeciﬁcityevaluation.

Duetotheextremelyrapidgrowth/modiﬁcationsinavailableDNAsequences,thecontinuedvalidityoftheDNAﬁngerprintsmustbefrequentlyveriﬁed.Effortstoautomaticallyupdateﬁnger-printsarealsoplanned.

deﬁnitionofaDNAﬁngerprintisprovided.Moreimportantly,giventhedesiredlengthofaﬁngerprintanditsrequirednumberofnon-matchingbasepairs,weprovideanalgorithmthatguaran-teesthatallinsilicoﬁngerprintsareidentiﬁed.Fingerprintsforanumberofpathogenicsequenceshavebeenpreliminarilyevaluatedthroughexperimentaltestswithpathogensofinterestandnon-targetgenomes.InitialresultsindicatethattheapproachiscapableofidentifyingmultipleﬁngerprintsforspeciﬁcDNAsequences,andthatthealgorithmcouldbeimprovedtoenhancespeciﬁcity.Furthertesting,withastandardpanelofnon-targetgenomes,isunderway.ThisinformationwillenableoptimalTOFIparameterselectionandwillserveasavaluablebenchmarkforfuturealgorithmimprovements.

DISCLAIMER

TheopinionsorassertionscontainedhereinaretheprivateviewsoftheauthorsandarenottobeconstruedasofﬁcialorasreﬂectingtheviewsoftheU.S.ArmyortheU.S.DepartmentofDefense.

ACKNOWLEDGEMENTS

Theauthorswishtoexpresstheirgratitudetotheanonymousrefereesforusefulcommentsandsuggestionsaswellasinterestingideasforfuturework.TheauthorsthankKamalKumaroftheBiotechnologyHPCSoftwareApplicationsInstituteforhelpindevelopingTOFI’sgraphicaluserinterfaceandBobStephens,JackCollinsandKarolMiaskiewiczoftheAdvancedBiomedicalComputingCenter,NationalCancerInstitute,Frederick,MD,forthecomputationalsupport.ThisworkwassponsoredbytheU.S.DepartmentofDefenseHigh-PerformanceComputingModernizationProgram(HPCMP),undertheHigh-PerformanceComputingSoftwareApplicationsInstitutes(HSAI)initiative,andtheU.S.DefenseThreatReductionAgency.ConflictofInterest:nonedeclared.

REFERENCES

Abee,T.etal.(2004)Impactofgenomicsonmicrobialfoodsafety.TrendsBiotechnol.,

22,653–660.

Altschul,S.F.etal.(1990)Basiclocalalignmentsearchtool.J.Mol.Biol.,215,

403–410.

Benson,D.A.etal.(2005)GenBank.NucleicAcidsRes.,34,D16–D20.

Chain,P.S.etal.(2004)InsightsintotheevolutionofYersiniapestisthroughwhole-genomecomparisonwithYersiniapseudotuberculosis.Proc.NatlAcad.Sci.USA,101,13826–13831.

Darling,A.etal.(2003)Thedesign,implementation,andevaluationofmpiBLAST.

In4thInternationalConferenceonLinuxClusters:TheHPCRevolution2003,inconjunctionwiththeClusterWorldConference&Expo,SanJose,CA.

Draghici,S.etal.(2005)Identiﬁcationofgenomicsignaturesforthedesignofassaysfor

thedetectionandmonitoringofanthraxthreats.InAltman,R.B.etal.(ed.),Proceed-ingsofthePaciﬁcSymposiumofBiocomputing2005Hawaii,USA,pp.248–259.Gardner,S.etal.(2004)Sequencingneedsforviraldiagnostics.J.Clin.Microbiol.,42,

5472–5476.

Gordon,P.M.K.andSensen,C.W.(2004)Osprey:acomprehensivetoolemploying

novelmethodsforthedesignofoligonucleotidesforDNAsequencingandmicroar-rays.NucleicAcidsRes.,32,e133.

Gusﬁeld,D.(1997)AlgorithmsonStrings,Trees,andSequences:ComputerScience

andComputationalBiology.CambridgeUniversityPress,Cambridge,UK.

Hardiman,G.(ed.)(2003)MicroarrayMethodsandApplications-NutsandBolts.DNA

Press,Eagleville,PA,USA.

Hass,S.A.etal.(2003)Genome-scaledesignofPCRprimersandlongoligomersfor

DNAmicroarrays.NucleicAcidsRes.,31,5576–5581.

Ivnitski,D.etal.(2003)Nucleicacidapproachesfordetectionandidentiﬁcationof

biologicalwarfareandinfectiousdiseaseagents.Biotechniques,35,862–869.

CONCLUSIONS

ThisworkpresentedTOFI,anintegratedbioinformaticstooltoidentifyinsilicogenomicﬁngerprintsforthedesignofmicroarraydiagnosticassays.TOFIisastandaloneapplicationthatexploitstheparallelprogrammingbeneﬁtsprovidedbyHPCplatformsandallowsuserstoselectinputparametersthroughagraphicaluserinterface.Thisworkdiffersfrompreviousonesinthataformal12

Oligonucleotidefingerprintidentification

Joos,T.andFortina,P.(2005)Microarraysinclinicaldiagnosis.HumanaPress,

Totowa,NJ,USA.

Kaderali,L.andSchliep,A.(2002)Selectingsignatureoligonucleotidestoidentify

organismsusingDNAarrays.Bioinformatics,18,1340–1349.

Kurtz,S.(2002)constructionandapplicationofvirtualsufﬁxtrees..PhDdissertation,

¨en,UniversitatBielefeld,Bielefeld,Germany.TechnischeFakulto

Kurtz,S.etal.(2004)Versatileandopensoftwareforcomparinglargegenomes.BMC

GenomeBiol.,5,R12.

Leber,M.etal.(2005)AfractionalprogrammingapproachtoefﬁcientDNAmelting

temperaturecalculation.Bioinformatics,21,2375–2382.

Lin,H.etal.(2005)EfﬁcientdataaccessforparallelBLAST.IEEEInternational

ParallelandDistributedProcessingSymposium,Denver,CO.

Nordberg,E.(2005)YODA:selectingsignatureoligonucleotides.Bioinformatics,21,

1365–1370.

Panjkovich,A.andMelo,F.(2005)Comparisonofdifferentmeltingtemperature

calculationmethodsforshortDNAsequences.Bioinformatics,21,711–722.Pruitt,K.D.etal.(2005)NCBIReferenceSequence(RefSeq):acuratednon-redundant

sequencedatabaseofgenomes,transcriptsandproteins.NucleicAcidsRes.,33,D501–D504.

Rahmann,S.(2003)Fastlargescaleoligonucleotideselectionusingthelongestcom-monfactorapproach.J.Bioinfo.Compu.Biol.,1,343–361.SantaLucia,J.,JrandHicks,D.(2004)ThethermodynamicsofDNAstructuralmotifs.

Annu.Rev.Biophys.Biomol.Struct.,33,415–440.

Schliep,A.etal.(2003)GrouptestingwithDNAchips:generatingdesignsanddecod-ingexperiments.InProceedingsoftheComputationalSystemsBioinformatics,August11-14,Stanford,CA,pp.84–91.

Sergeev,N.etal.(2006)MicroarrayanalysisofBacilluscereusgroupvirulencefactors.

J.Microbiol.Meth.,65,488–502.

Slezak,T.etal.(2003)Comparativegenomicstoolsappliedtobioterrorismdefense.

Brief.Bioinform.,4,133–149.

Urisman,A.etal.(2005)E-Predict:acomputationalstrategyforspeciesidentiﬁcation

basedonobservedDNAmicroarrayhybridizationpatterns.BMCGenomeBiol.,6,R78.

Viljoen,G.J.etal.(eds)(2005)MolecularDiagnosticsPCRHandbook.Springer

Publishers,Berlin,Germany.

Wang,D.etal.(2002)Microarray-baseddetectionandgenotypingofviralpathogens.

Proc.NatlAcad.Sci.USA,99,15687–15692.

Weiner,P.(1973)Linearpatternmatchingalgorithms.InProceedingsof14thIEEE

AnnualSymposiumonSwitchingandAutomataTheory,Washington,DC,IEEEComputerSoc.,pp.1–11.

Willse,A.etal.(2004)Quantitativeoligonucleotidemicroarrayﬁngerprintingof

Salmonellaentericaisolates.NucleicAcidsRes.,32,1848–1856.

因篇幅问题不能全部显示，请点此查看更多更全内容

查看全文

全部栏目

BIOINFORMATICS ORIGINAL PAPER Sequence analysis