Sequenceanalysis
ORIGINALPAPER
Vol.23no.12007,pages5–13doi:10.1093/bioinformatics/btl549
Oligonucleotidefingerprintidentificationformicroarray-basedpathogendiagnosticassays
WaibhavTembe,NelaZavaljevski,ElizabethBode1,CatherineChase1,JeanneGeyer1,LeonardWasieloski1,GaryBenson2andJaquesReifmanÃBiotechnologyHPCSoftwareApplicationsInstitute,TelemedicineandAdvancedTechnologyResearchCenter,USArmyMedicalResearchandMaterielCommand,Ft.Detrick,MD,1DiagnosticSystemsDivision,USArmyMedicalResearchInstituteofInfectiousDiseases,Ft.Detrick,MDand2DepartmentsofBiologyandComputerScience,BostonUniversity,Boston,MA,USA
ReceivedonJune15,2006;revisedonOctober18,2006;acceptedonOctober21,2006AdvanceAccesspublicationOctober26,2006AssociateEditor:JohnQuackenbush
ABSTRACT
Motivation:AdvancesinDNAmicroarraytechnologyandcomputa-tionalmethodshaveunlockednewopportunitiestoidentify‘DNAfin-gerprints’,i.e.oligonucleotidesequencesthatuniquelyidentifyaspecificgenome.Wepresentanintegratedapproachforthecomputa-tionalidentificationofDNAfingerprintsfordesignofmicroarray-basedpathogendiagnosticassays.WeprovideaquantifiabledefinitionofaDNAfingerprintstatedbothfromacomputationalaswellasanexperimentalpointofview,andtheanalyticalproofthatallinsilicofingerprintssatisfyingthestateddefinitionarefoundusingourapproach.Results:Thepresentedcomputationalapproachisimplementedinanintegratedhigh-performancecomputing(HPC)softwaretoolforoligonucleotidefingerprintidentificationtermedTOFI.WeemployedTOFItoidentifyinsilicoDNAfingerprintsforseveralbacteriaandplas-midsequences,whichwerethenexperimentallyevaluatedaspotentialprobesformicroarray-baseddiagnosticassays.Resultsandanalysisofapproximately150insilicoDNAfingerprintsforYersiniapestisand250fingerprintsforFrancisellatularensisarepresented.
Availability:Theimplementedalgorithmisavailableuponrequest.Contact:jaques.reifman@us.army.mil.
INTRODUCTION
Therecentadvancesingenomicsequencingandtheavailabilityoflarge-scalesequencedatabaseshaveunlockedseveralopportunitiestoidentify‘genomicsignatures’or‘DNAfingerprints’,i.e.shortDNAsequencesthatuniquelyascertainthepresenceorabsenceofcausativebiologicalagents,suchasviruses,bacteriaorvirulentgenes.Forexample,avastnumberofDNA-baseddetectionanddiagnostictechnologiesarebeingdevelopedtoquicklyidentifybiologicalthreatagents(Ivnitskietal.,2003;Slezaketal.,2003;Draghicietal.,2005;KaderaliandSchliep,2002),suchastheanthrax-causingbacterium,Bacillusanthracis,andtheplague-causingbacterium,Yersiniapestis.DNAsignaturescouldalsobeusedtodetectthepresenceofoneormorevirulentgenes,suchasBacillusgenes,whichencodeimportantvirulencefactors,entereotoxinsandexotoxins(Sergeevetal.,2006),andtoprovide
ÃTowhomcorrespondenceshouldbeaddressed.
high-resolutiondifferentiationbetweencloselyrelatedmicroorgan-ismsinmicrobialforensics(Willseetal.,2004).Newvirusesandstrainshavebeenidentifiedusingaspecialmicroarraytechnologyconsistingofapproximately1100070meroligonucleotides(Wangetal.,2002).DNAfingerprintshavealsobeenusedtodevelopdiagnosticassaysforawide-rangeofimportantapplicationsinmedicine,environmentalmonitoringandqualitycontroloffoodproducts(Hardiman,2003;JoosandFortina,2005;Wangetal.,2002;Abbeetal.,2004).
ThespecificalgorithmimplementedinaDNAfingerprintidenti-ficationmethodisselectedbasedon(1)whethertheDNAfinger-printsarebeingsoughtforaspecificpathogenstrain(e.g.Y.pestisCO92),agroupofpathogensfromthesamespecies(e.g.allY.pestisstrains)orgenus(e.g.allYersiniaspecies),orasetoforganismsthatmayormaynothaveanyphylogeneticrelationship(e.g.todetectaviralorabacterialfamily)and(2)theexperimentalconditionsspecifiedbytheendapplicationtechnology,suchasPCR(Slezaketal.,2003;Viljoenetal.,2005;Haasetal.,2003;GordonandSensen,2004)orDNAmicroarrays(KaderaliandSchliep,2002;Hardiman,2003;Rahmann,2003;Leberetal.,2005;Nordberg,2005).Theuseofreal-timePCR-baseddetectiontechnologyrequirestheidentificationofthreeinformativesequences:twoamplificationprimersequencesandanadditionalprobesequence(thefinger-print).Theassayrequiresthatprimerhybridizationtakesplacenearthefingerprintand,therefore,imposesconstraintsontheposi-tionoftheprimerandPCR-basedfingerprints.Moreover,PCR-basedassaysarequitelimitedintheirmultiplexingcapabilities,asdifferentassaysarerequiredtodetectdifferentpathogenicsequences.Incontrast,microarraysdonotimposeanypositionspecificconstraintsontheDNAfingerprints,andseveralfinger-printscanbesimultaneouslyplacedonamicroarraytoprovidedetectionredundancyandallowforthediagnosisofmultiplepatho-gensonasingleassay.Despitetheseadvantages,microarray-basedassaysarerelativelyinsensitiveandslowcomparedtotheexquisitesensitivityandspeedofPCR-basedassays.Microarraysensitivitycanbegreatlyenhancedbyincorporatingsampleamplificationpriortohybridizationbut,unfortunately,thisresultsinanetincreaseinassaytimeforalreadyslowassays.
ThispaperisconcernedwiththeidentificationofDNAfinger-printsforspecific,singlepathogenicsequences,referredtoasthe
ÓTheAuthor2006.PublishedbyOxfordUniversityPress.Allrightsreserved.ForPermissions,pleaseemail:journals.permissions@oxfordjournals.org
5
W.Tembeetal.
target,forthedesignofDNAmicroarray-baseddiagnosticassays.Thetargetcouldbeanentiregenome(e.g.B.anthracisAmes),achromosome(e.g.Brucellamelitensisbiovarabortus2308chromo-someII)oranon-chromosomesequence(e.g.B.anthracisplasmidpXO2).Amoregeneralproblem,notaddressedhere,involvestheidentificationofDNAfingerprintscommontomultiplestrainsormultiplespecies.AneffectiveapproachforthisproblemistousemultiplegenomealignmentandsearchforconservedregionstoidentifycommonDNAsignatures(Slezaketal.,2003).Identifica-tionofcommonDNAsignaturesbecomesanevengreaterproblemforhighlyvariableRNAviruses(Gardneretal.,2004),whereapromisingsolutionistoselectcombinationsofnon-uniqueprobesanduseuniquehybridizationpatternstounambiguouslyidentifyspecificviralstrains(Urismanetal.,2005;Schliepetal.,2003).Giventhelonglengthofmosttargets,theidentificationofDNAfingerprintsisaproblemofhighcomputationalcomplexity.Thepotentialsolutionspaceisextremelylargebecauseeverysubse-quenceofthetargetsequenceneedstobeconsidered.Furthermore,thedeterminationofuniquenessrequirescomparisonwithnucleot-idedatabases,suchastheGenBank(Bensonetal.,2005),thataregrowingexponentiallyinsize.Moreover,uniquenessofDNAfin-gerprintsobtainedusingsuchcomparativealgorithmsisonlyvalidwithrespecttothereferencedatabaseused.Asnewsequencesaremadeavailable,previouslyidentifiedfingerprintsneedtoberevalidated.
SeveralpracticalchallengesariseintheexperimentalevaluationofthecomputationallyidentifiedDNAfingerprintsformicroarray-basedassays.TheDNAfingerprintsshouldproduceahighresponsewhenhybridizedwithasamplecontainingthetargetgenome.Con-versely,theresponseforanynon-targetgenomeshouldbeaslowaspossible.Thus,algorithmsforDNAfingerprintidentificationmustincludeexperimentalconstraintsandDNA–DNAhybridiza-tionmodelingmethodstopredicttheresponseonamicroarray.AlthoughresearchinmodelingmolecularlevelinteractionsbetweenDNAsequenceshasmadesignificantprogress,thereis,unfortunately,noanalyticalmethodavailabletodaythatcanpredicttheexactoutcomeofahybridizationreactionbetweentwoormorearbitraryDNAsequences(SantaLuciaandHicks,2004;Nordberg,2005).Moreover,duetothevariabilityintheoutcomeofahybrid-izationexperiment,alargenumberofrepetitionsarerequiredtoexperimentallyevaluatetheDNAfingerprints.Itmightnotbepos-sibletoexperimentallytestallthecomputationallyidentifiedfin-gerprintsbecauseoftheassociatedcostsandlimitedresources.Tosimultaneouslyaccommodatethesecomputationalneedsandexperimentalconstraints,DNAfingerprintidentificationtools(KaderaliandSchliep,2002;Rahmann,2003;Leberetal.,2005)integratecomputationalalgorithmsforidentifyinguniquesequencesandDNAhybridizationmodelingtoolsforpredictingtheoutcomeofthemicroarrayexperiment.Often,thesetoolsapplyvariousapproximationstoreducethecomputationalcomplex-ity.Forexample,theintegratedapproachofKaderaliandSchliep(2002)usesanefficientsearchalgorithmbasedonsuffixtreesandasimplifiedtwo-statetransitionnear-neighborthermodynamicmodelforDNAprobedesignandcross-hybridization.Thesimpli-fiedmodelreducesthecomputationaltimebutintroducesmodelingerrorsintheDNAfingerprintdesign.Asimilarthermodynamicmodelwithacomputationallymoreefficientapproachwaspro-posedbyRahmann(2003).Anefficientfractionalprogrammingapproachformelting-temperaturecomputationwithanimproved6
two-statetransitionnear-neighborthermodynamicmodelhasalsobeenproposed(Leberetal.,2005),however,computationaltimewouldstillbeanissueforcross-hybridizationevaluationofalargenumberofnon-targetgenomes.
ComputationalandexperimentalfactorsmakequantificationoftheuniquenessorspecificityofashortDNAsequencechallenging.Infact,aliteraturesurveyindicatesthattherelatedstudieshavenotstatedaprecise,quantitativedefinitionofaDNAfingerprint.Althoughthegeneralideaistosearchthetargetgenomefor‘unique’DNAsequencesandthentestthemexperimentally,theinsilicocriterionforuniquenesshasnotbeenexplicitlystated.Inthispaper,wefirstprovideaformaldefinitionofaDNAfingerprintbasedonvariousexperimentalconditionsandaspecificitycriterion.Wethendescribeanintegratedapproachthatcombinesefficientbioinformaticsalgorithms,takesintoaccountexperimentalcon-straints,andincludesalarge-scalecomparisonofDNAfingerprintswithnucleotidedatabases.Next,wedescribethealgorithmunder-lyingTOFI(toolforoligonucleotidefingerprintidentification),itssoftwareimplementationonahigh-performancecomputing(HPC)platform,andananalyticalapproachtochoosetheinputparametersofTOFI,whichguaranteesthatallpossibleDNAfingerprintssat-isfyingthestateddefinitionareobtained.Finally,wediscussinitialexperimentalresults,whichhelpevaluateourdefinitionofaDNAfingerprintandtheassociatedspecificitycriterion.
TERMINOLOGYANDPROBLEMDEFINITION
ADNAfingerprintforagiventargetgenomegtisdefinedwithrespecttoareferencenucleotidesequencedatabasedenotedbyG¼{g1,g2,...,gn,...,gN}thatcontainsNsequences.Inpractice,GconsistsofDNAsequencesfromoneormorepubliclyavailablecomprehensivedatabases,suchasGenBank,oranyothersmallernucleotidedatabase,suchasaviralDNAsequencedatabase.ThetargetgenomemayormaynotbelongtoG,implyingthatitcouldbeaknownpathogenoranewlysequencedone.ImplicitinthedefinitionofaDNAfingerprintisitsvaliditywithrespecttotheavailabledatabase.AsnewerDNAsequencesbecomeavailableandareaddedtoG,itisnecessarytoverifythevalidityofthepreviouslyidentifiedfingerprints.
Basedontheschoolofthought,computationalorbiological,thedefinitionofaDNAfingerprintvaries.Therefore,somediscussionaboutournotionofaDNAfingerprintisinorder.
Fromapurecomputersciencestandpoint,aDNAfingerprintofgtcouldbedefinedas‘anysubsequenceofgtthatisnotasubse-quenceofanygn2G,n¼t’.Bythisdefinition,theproblemofidentifyingDNAfingerprintsisequivalenttotheclassicstringcomparisonproblemofidentifyingsubstringsofgtthatdonotexactlymatchanysubstringofanygn2G,n¼t.Althoughmathe-maticallycorrect,thisdefinitionlackstheapplication-specificrequirements.DNAfingerprintshavetosatisfy:(1)designcon-straints,sothattheycanbeusedasDNAprobesonmicroarraysand(2)specificityconstraints,sothattheycandiscriminate,inamicroarrayhybridizationreaction,betweentargetandnon-targetsequences.DNAfingerprintsthatsimultaneouslysatisfybothdesignandspecificityconstraintsrequireabiologicallymoresounddefinition.Wemathematicallyformalizetheexperimentalandspecificityconstraintsasfollows.
LetKdenotetheDNAmicroarrayexperimentalconstraints,suchastheminimumandmaximumlengthoftheDNAfingerprint,the
Oligonucleotidefingerprintidentification
hybridizationmelting-temperature,GCcontent,etc.(SantaLuciaandHicks,2004),andletP¼{p1,p2,...,pi,...,pI}denotethesetofallsubsequencesofgtthatsatisfyK.Thus,bydefinition,everypi2Pwillhavelengthwithinthespecifiedminimum(Lmin)andmaximum(Lmax)bounds,GCcontentwithintherequiredrange,andwillsatisfyseveralotherpropertiesspecifiedbythechosenDNAhybridizationmodelingmethodology.WerefertothesequencesinPasDNAprobes.NotethatconstraintsdenotedbyKdonotspecifyattributesregardingtheuniquenessofaDNAprobewithrespecttonon-targetsequences.
QuantifyingspecificityofDNAfingerprintsfromanexperimentalpointofviewisverysubjective.ItisbasedoninterpretingtheexperimentalhybridizationresultsbetweenDNAprobesandnon-targetDNAsequences.Sincethereisalackofaccurateinsilicohybridizationmodels,weinferthespecificityofaDNAprobebyfirstcomputingDNAsequencealignments,andthendeterminingifthealignedprobemeetsanempiricalthresholdT.ThisisbasedonthehypothesisthatDNAsequencesthatalignpoorlyareunlikelytoformastableDNA–DNAduplexforagivensetofexperimentalconstraints.ThishypothesisimpliesthattheDNAsequencealignment,calculatedstrictlyusingcom-putationaltools,providesquantification,throughthethresholdT,oftheactualstrengthoftheDNA–DNAduplex.Thus,wecomputethespecificityofaDNAprobefromthenumberofmismatches,gapsandinsertions/deletionsinthealignmentandcompareitwiththethresholdT,representingthesetUofallspecificityconstraints.Havingformalizedtheexperimentalandspecificityconstraints,wedefineaDNAfingerprintandtheproblemofidentifyingallDNAfingerprintsforatargetgenomeasfollows:
Definition(DNAfingerprint):ADNAprobepioflengthLiisconsideredaDNAfingerprintofgtifandonlyifanoptimalsequencealignmentbetweenpiandanyothersequencegn2G,n¼t,hasatmostLi-Tmatches.
Definition(DNAfingerprintidentification):ForatargetDNAsequencegt,findallinsilicoDNAfingerprintsthatsatisfytheexperimentalconstraintsKandspecificityconstraintsUwithrespecttoareferenceDNAsequencedatabaseG.
LetS¼{s1,s2,...,sf,...,sF}beasubsetofP,i.e.SP,thatdenotesthesetofDNAprobesthatsatisfybothconstraintsKandU.OurgoalistofindallFelementsofS.WerefertotheelementsofSasinsilicoDNAfingerprintsbecausetheysatisfyallconstraintsthathavebeenquantifiedforcomputationalpurposes.Theirexperi-mentalvalidityneedstobetestedinanactualDNAmicroarrayexperiment.Unlessstatedotherwise,henceforththeterm‘DNAfingerprint’impliesinsilicoDNAfingerprint,whichisvalidwithrespecttoareferencedatabaseused.
Fig.1.TheThreeStepsoftheTOFIAlgorithm.
database.Auser-definedspecificityconstraintUisusedtointerpretthealignmentsfromthestandpointofcross-hybridizationbetweentheDNAprobesandnon-targetgenomes.DNAprobessatisfyingUarereportedasDNAfingerprintsandaretestedonmicroarrays.
Althoughtheproblemhasbeensplitintothreediscretestepsforclarityofexplanation,theindividualstepsarenotcompletelyinde-pendentfromanalgorithmicstandpoint.Infact,ourobjective,toobtainallDNAfingerprints,leadsustoconcludethattheinputparametersinthefirststephaveananalyticalrelationshipwiththeconstraintsimposedinthesecondandthethirdsteps.Wediscusstheinterdependenceofthethreestepsafteranin-depthdescriptionofeachofthethreesteps.
Step1:solutionspacereduction
Thesolutionspacetobesearchedisextremelylargebecauseeverysubsequenceofthegiventargetmustbeconsidered.Testingeachsubsequenceexperimentallyisimpracticalandexpensive.Butreducingthesolutionspacecomputationallycanbequickandcheaper.Forthispurpose,weexploitthesequencesimilaritiesbetweenthetargetgenomeandanevolutionarynear-neighbor(gr)thatcanbeidentifiedfromaphylogenetictreeorpublisheddata.ThetargetandneighborwillcontaincommonDNAsequences,which,obviously,cannotbeusedasDNAfingerprints.DNAsequencescommontobothgtandgrareextractedusingsuffixtrees(Wiener,1973;Gusfield,1997),which,withinthedomainofcomparativegenomics,havebeenusedinVmatch(Kurtz,2002)andtheMaximalUniqueMatcher(MUMmer)(Kurtzetal.,2004)toidentifyrepeats,exactorapproximatematches,andsinglenucleotidepolymorphisms.Detailsoftheconstruction,traversalandnumerousapplicationsofsuffixtreesinseveraldif-ferentstring-matchingapplicationsareavailablein(Gusfield,1997).Itshouldbenotedthatthesolutionspacecouldbefurtherreducedbycomparingthetargetgenomewithmultiplenon-targetgenomes,asdescribedbySlezakandcolleagues(Slezaketal.,2003).ThiscouldbedonewithinTOFI’scurrentalgorithmiccon-figurationbyconcatenatingstringsfrommultiplenon-targetgen-omesintoonelongstringandprovidingitasthenear-neighborgenomeforcomparison.Analternate,perhapsmoreefficient
7
INTEGRATEDAPPROACH
TOFIimplementsamulti-stepapproachthatbreaksdowntheprob-lemoffingerprintidentificationintothethreestepsillustratedinFigure1.ThefirststepreducesthesolutionspacebydiscardingDNAsequencescommontoboththetargetsequenceandoneormorebiologicalnear-neighborsequences.Thesurvivingsequencesaretermedcandidatesequences.Inthesecondstep,amicroarrayDNAprobedesignphaseextractsfromcandidatesequencesonlythosesubsequencesthatsatisfytheapplication-specificexperi-mentalconstraintsK.Inthethirdstep,eachDNAprobeisalignedwithallDNAsequencespresentinthechosenreferencenucleotide
W.Tembeetal.
Table1.TypicalvaluesforDNAprobedesignconstraintsKLength(bases)
Fig.2.OutputoftheSuffix-Tree-BasedAlgorithm.
Minimum,Lmin:35Maximum,Lmax:40
Melting-temperature(C)Minimum,Tmin:70Maximum,Tmax:75
GCcontent(%)45–50
approach,whichwouldrequiresoftwaremodification,istocomparethetargetsequentiallyagainstalistofnon-targetsequences,sothataftereachcomparisononlyunmatchedsequencesarecomparedwiththesubsequentnon-targetgenomesfromthelist.
Oncetheexactmatchesbetweenthetargetgtanditsnear-neighborgrareidentified,thetargetcanberepresentedasacon-catenationgt¼C0M1C1M2C2...MJCJ,asshowninFigure2.MjdenotesthejthexactmatchoflengthjMjjandCjdenotesthejthcandidatesequence,i.e.asequenceinthetargetthatcontainsnomatchesoflengthMorlongerwiththenear-neighbor.C0and/orCJcanbenullbasedonwhetherornotthereisanexactmatchatthebeginningortheendofgt,respectively.ExactmatchesthatarelongerthantheminimumlengthMarediscardedandonlythecandidatesequencesareretainedforfurtherconsideration.Thecandidatesequenceshavenorestrictionwithrespecttotheirlength,positioninthegenome,orcompositionofbasepairs.Ofparticularinteresttoourapplicationisthechoiceofinputparameterstothesuffix-tree-basedalgorithm,inparticular,theminimumlengthMofexactmatchesbetweengtandgrthatwouldleadtoidentificationofallDNAfingerprints.OuranalysisindicatesthattheparameterMiscloselyrelatedtotheexperimentalandspecificityconstraints,detailedinSteps2and3,respectively.Therefore,wefirstdescribetheremainingtwostepsofTOFIbeforeananalyticalrelationisderivedbetweentheparameterMandtheconstraintsimposedbytheproblemdefinition.
ThisselectionofMdiffersfromarelatedstudy(Slezaketal.,2003),whereM¼18washeuristicallyselectedtomeettheminimalPCRprimersize.
Step3:specificitydeterminationbysequencealignment
Inthethirdstep,everyDNAprobeisalignedwithsequencesinthereferencenucleotidedatabase.Theresultsofthealignmentsareinterpretedtopredictcross-hybridizationusingthefollowinggeneralrule:ADNAprobethatalignspoorlywithallnon-targetsDNAsequencesisunlikelytocross-hybridizewithnon-targetsand,therefore,shouldbeconsideredasafingerprint.
DuetothelimitationsofDNA–DNAhybridizationmodels,deter-miningthealignmentcorrespondingtotheoptimalDNA–DNAduplexonamicroarrayishard.Computationally,optimalalignmentbetweentwoDNAsequencescouldbedefinedusingthegeneralizededitdistancealgorithm(Gusfield,1997).Simplyput,theeditdis-tancebetweentwosequencescorrespondstothetotalnumberofinsertions,deletionsandsubstitutionsthatareneededtotransformonesequenceintotheother.FromthestandpointofDNAcross-hybridization,asubstitutioncorrespondstoamismatchedpairofnucleotidesandinsertions/deletionscorrespondtogapsintheDNA–DNAduplex.Thelowerthenumberofmismatchesandgapsinthealignmentisthelowerintheeditdistance.However,editdistancedoesnotprovidesufficientinformationwithregardstothestrengthofhybridization.Forexample,itdoesnotconsiderthepositionofmatchesinthealignment,GCcontent,gaplengthandthelongestcommonfactorinthealignment(Rahmann,2003).Moreover,computingtheoptimalalignmentbetweeneachDNAprobeandeveryDNAsequenceinalargedatabase,suchasthe‘nt’nucleotidedatabasefromtheNationalCenterforBiotechnologyInformation(NCBI,http://www.ncbi.nlm.nih.gov/)(Pruittetal.,2005),wouldbeverycomputationallyintensive.
Basedontheseissuesandpracticaltimeconstraints,weoptedtousetheBLASTNprogramfromBLAST(Altschuletal.,1990)foraligningDNAprobeswithareferencedatabase.ThealignmentalgorithminBLASTisaheuristics-basedapproachthatstartsoffbyidentifyingawordofexactmatchofagivenlength(wparameter)andproceedsbyextendingitusingdynamicprogrammingtoallowmismatchesandgapsinthealignment.AstatisticalsignificancescoretermedE-valueisusedtodistinguishbetweenpotentiallymeaningfulalignmentsandchancealignments.TheE-valuescorewasusedin(Draghicietal.,2005)toquantifythespecificityofDNAprobes.However,E-valuesaredeterminedbythelengthofthealignment,sizeofthequery,sizeofthetotaldatabaseandseveralotherparametersthatarenotrelatedtotheabilityofaprobetoformacross-hybridwithanon-targetgenome.
InsteadofusingE-valuealonetodetermineprobespecificity,TOFIexaminestheactualalignmentsreportedbyBLASTanddeterminesthespecificityofaprobebytakingintoaccountthenumberofmatches,mismatchesandgapsinthealignmentinde-pendentofitsstatisticalsignificance.ThesespecificityconstraintsUformthebasisfortheempiricalthresholdT,usedinthefollowing
Step2:microarrayprobedesign
ThesecondstepimposesasetofexperimentalconstraintsKtoextractDNAmicroarrayprobesfromthecandidatesequences.Arecentreview(PanjkovichandMelo,2005)indicatesthatforthesameinputDNAsequencesdifferentinsilicoprobedesignmod-elingtools,notsurprisingly,producedifferentsetsofDNAprobes.Toourknowledge,thereisnouniversallyacceptedmodelingmeth-odologyavailabletodaytodesignmicroarrayprobesfromDNAsequences.Often,thesetoolsareusedinaniterative,trial-and-errorfashiontooptimizethequality/numberofoutputDNAprobestosuitetheapplication-specificneeds.
Wehaveselectedaprobedesigntoolthatimplementsamulti-statethermodynamicmodelformelting-temperature(SantaLuciaandHicks,2004).ThemodelallowsfortherepresentationofseveraldozenconstraintsontheDNAprobes,suchasprobelength,GCcontent,molarconcentrations,self-hybridizationpossibilitiesandlimitonthenumberofsinglenucleotiderepeats.AdditionalinformationontheconstraintsKcanbefoundinSantaLuciaandHicks(2004).Asanexample,onlyafewimportantconstraintsareshowninTable1.TheDNAprobessatisfyingtheseconstraintsareextractedfromeverycandidatesequenceandarepassedontothenextstep.8
Oligonucleotidefingerprintidentification
Table2.AlgorithmforcandidatesequenceselectionusingthesuffixtreeoutputVariables:
Procedure:
Input:Lmax,Lmin,T,gtandgrgt¼Targetgenome
gr¼Near-neighborgenomeofgt(1)Pcand¼emptysetLmin¼Minimumprobelength(2)LetE¼LmaxÀTLmax¼Maximumprobelength(3)LetM¼E+1¼LmaxÀT+1T¼Thespecificityconstraint,i.e.thecombinedminimum(4)Usingsuffixtree,identifyexactmatchesM1,...,MJoflengthnumberofmismatches,insertionsanddeletionsintheoptimalatleastMbetweengtandgr,andcandidatesC0,C1,...,CJfromgtalignmentbetweenaprobeandanynon-targetsequence,definingafingerprint(5)ForeachcandidatesequenceCjfromthesuffixtreeoutput,M¼Inputparameterofthesuffix-tree-basedalgorithmthatspecifies(A)IfCjisaprefixofgt,thenextendCjbyEbasestotheminimumlengthofexactmatchesbetweengtandgrtheright.GotostepD
(B)IfCjisasuffixofgt,thenextendCjbyEbasestoMj¼jthexactmatchbetweengtandgr,wherej¼1,...,J
Cj¼jthcandidatesequence,j¼0,1,...,J,definedasasubsequencetheleft.GotostepD
(C)ExtendCjontheleftandtherightbyEbasesofgtthatis:
boundedonbothsidesbyexactmatchesoflengthatleastM,or(D)AddallsubsequencesofCjsatisfyingthelengthlocatedbetweenthestartofgtandthefirstexactmatchoflengthatleastconstraintstoPcandM,orlocatedbetweenthelastexactmatchoflengthatleastResult:EverycandidateDNAprobeofgtsatisfyingconstraintsMandtheendofthegenomegtLmax,Lmin,andTwillbeinPcandE¼Lengthoftheextensionofcandidatesequencesintotheadjacentexactmatch(es)
hypothesistoinferhybridizationpatternsfrom‘optimal’BLASTalignments:IfthebestBLASTalignmentbetweenaDNAprobeandanon-targetgenomehasmorethanTmismatchesorgaps,thentheDNAprobewillbeconsideredasaninsilicoDNAfingerprint.ItiswelldocumentedthatusingBLASTforassessmentofcross-hybridizationofaprobewithnon-targetgenomeswillresultinsomenon-specificprobes(Rahmann,2003;Nordberg,2005).Ifawordoflengthwisnotfoundinadatabasesequence,theprobealignmentwiththesequencewillbeskippedresultinginpotentialmissedcross-hybridization.Inothersituations,partialalignmentswithprobesmayresultinunderestimatedcross-hybridization.Twopromisingapproachescouldbeconsideredtoimproveprobespe-cificity.First,additionalfilteringoftheprobesselectedasfinger-printsbyBLASTcouldbeperformedtoaugmentthehypothesisrelatingalignmenttohybridization.Inthiscase,additionalinforma-tionwouldbeextractedfromanalysesofthealignmentsreportedbyBLAST,suchasthemaximumnumberofcontiguousmatchesorthepositionofmatchesintheprobealignment,andusedasrulestoimprovespecificityconstraints.Second,betteralignmentalgo-rithmscouldbeimplementedasapost-processingstep,whichwouldincorporatehybridizationthermodynamicsintothealign-mentevaluationtotakeintoaccounthybridizationstability(Leberetal.,2005).However,itmustbeemphasizedthatthelackofanaccuratemodeltodirectlyrelateaDNAsequencealign-mentwithitscorrespondingDNA–DNAhybridizationleavesthechoiceofprobespecificitycharacterizationasanopenquestion.
TOFIparameterselection
ThethreestepsinTOFIimplementdifferentbioinformaticsalgo-rithms,eachcarryingoutadifferenttaskusingitsownsetofinputparameters.However,theminimumlengthoftheexactmatchesMinthefirststepisanalyticallyrelatedtothelengthconstraintsLminandLmaxontheDNAprobesinthesecondstepandthespecificitythresholdTusedinthethirdstep.Inthissection,wemathematicallyderivethisanalyticalrelationship.
TheproblemofselectinganappropriateMvaluecouldbestatedasfollows:giventhelengthconstraintsLminandLmaxonthelengthoftheprobesandthespecificitythresholdT,findarelationshipbetweenLmin,Lmax,TandM,whichguaranteesthatnovalidDNAfingerprintsarediscarded.
OurapproachisinitiatedbyextendingeachcandidatesequenceCj,j¼0,1,...,J,byEbasesintoeachsideoftheneighboringexactmatch.ThispreventsthepossiblediscardingofsignaturesthatincludetheboundariesofCj.FromtheextendedcandidatesequencesweconstructacandidateDNAprobesetPcand,whichcontainseverysequencesatisfyingthelengthconstraintsLminandLmax.OnlythoseDNAprobesinPcandthatsatisfytheexperimentalconstraintswillbeincludedintheprobesetPforalignmentwiththereferencedataset,i.e.PPcand.
WechooseEsuchthatM>E.Thisconditionguaranteesthattheoverlapsbetweentwoadjacentextendedcandidates,ifany,willbelimitedtotheexactmatchregionseparatingthetwocandidates.ItalsosetsalowerlimitonM,M¼E+1.ToensurethatanycandidateDNAprobeoflengthLihavinglessthanorequaltoLiÀTexactmatchesisnotdiscarded,theextensionlengthshouldbeE¼LiÀT.Thus,theextensionlengthEisconstrainedbyLminÀT E LmaxÀT.SubstitutingforM¼E+1andmakingaconservativeselection,weobtainM¼LmaxÀT+1.Suchselectionwill,mostlikely,generateacandidateprobesetPcandthatcontainssomefalsepositives,i.e.probesthatdonotsatisfythespecificityconstraints.Themajorityofsuchnon-specificprobeswillbediscardedaftertheBLASTalignmentinspection.However,somefalsepositiveswillremainduetopossiblemissedmatchesinBLAST,asdescribedintheprevioussection.ThedetailsofthecandidateprobeselectionalgorithmaregiveninTable2.
Finally,weprovethatifthecandidatesequencesareobtainedusingE¼LmaxÀT,withM¼E+1,thenallDNAfingerprintsisincludedinthesetPcand.However,asSdenotesthesetofDNAfingerprintsforgt,itwillsufficetoprovethatsuchselectionforMguaranteesthatSPcand.
9
W.Tembeetal.
Assertion1.Byconstruction,everysubsequenceofgtcontaininganexactmatchoflengthsmallerthanMisincludedinanextendedcandidatesequence(steps5.A–5.CinTable2).
Assertion2.Fromeachextendedcandidatesequence,everysub-sequencesatisfyingthelengthconstraintsLminandLmaxisincludedinPcand(step5.DinTable2).
Assertion3.Fromassertions1and2,noneofthesequencesinPcandcancontainanexactmatchoflengthMorgreater.
Assertion4.Bydefinition,aDNAfingerprintsi2ScontainsatmostLiÀTexactmatcheswhenalignedwithanynon-targetgenome.Thus,thelengthofthelongestexactmatchbetweensiandanyothernon-targetgenomeisLiÀT.Since,LiÀT LmaxÀT¼E Softwareimplementation WeusedMUMmer(Kurtzetal.,2004),anopensourcesoftwarethatimplementsasuffix-tree-basedalgorithmandprovidesseveraloptionsforcomparinggenomicsequences.ThemicroarrayDNAprobedesignfromcandidatesequenceswascarriedoutusingthecommercialsoftwareoligonucleotidemodelingplatform(OMP)(availableathttp://www.dnasoftware.com),whichimplementsastate-of-the-arthybridizationmodel(SantaLuciaandHicks,2004).TheBLASTNprogramfromNCBI_BLAST(version2.2.10)wasusedforaligningmorethan2.0millionnucleotidesequencesstoredinthe‘nt’nucleotidedatabaseattheNCBI.Thedatabasehasgrownsignificantlysinceweobtainedtheresultsdescribedinthispaperandwehavedownloadedthelatestversion,containingmorethan3.6millionsequences,forfuturerunsofTOFI.TheentiresoftwarepipelinewasinitiallyimplementedonaHPCenvironmentattheAdvancedBiomedicalComputingCenter(http://www-fbsc.ncifcrf.gov/)usingtheHighThroughputCom-putingsupportfromSGIÒonanAltixclusterconsistingof64·1.5GHzItanium2processorsrunningRedHatÒLinuxwith64GBofsharedmemory.Oncecandidatesequencesareobtained,TOFItakesadvantageoftheparallelprogrammingoppor-tunitiesonHPCresources.TheDNAprobedesignusingOMPhasbeenparallelizedusingOpenMPbyschedulingDNAprobedesignforeachcandidatesequenceonaseparateprocessor.TheexecutionofBLASTN,byfarthemostcomputationallyintensivepartofTOFI,isparallelizedbyassigningbatchesofDNAprobestosepa-rateprocessors.Inaddition,severalapplication-specificsoftwaremodulestoprocessDNAsequences,tocompileresultsoftheinter-mediatestagesforanalysis,andtoprocessoutputsofvariousstageswereimplemented.ThischoiceofresourcesandsoftwareisjustonewaytoimplementTOFI’sintegratedapproachshowninFigure1.Adifferentchoiceofsoftwareforsuffixtree,DNAprobedesignandsequencealignmentscouldbeusedaswell.How-ever,ourparticularchoicerepresents,arguably,someofthebesttoolsavailableforeachofthethreesteps. TOFIhassincebeenportedontooneoftheU.S.DepartmentofDefenseMajorSharedResourceCenter’sLinuxclusters,consistingof128dualprocessornodesonadistributedmemorysystem,wheredeploymentofmpiBLAST(Darlingetal.,2003)andexecutionofOMPonseparateprocessorsisbeingtested.Inthecurrentclusterimplementation,weusempiBLASTwith32processorsrunninginparallel,which,again,consumesthebulkofthecomputingtime.10 Fig.3.IdentificationofY.pestisDNAFingerprintsUsingTOFIona32-CPULinuxCluster. ThecomputationaltimeofthealgorithmdependsonthenumberofprobesgeneratedastheoutputofStep2andprovidedtompi-BLASTandonthesizeofthereferencedatabase.Thenumberofprobes,inturn,dependsonthelengthofthetargetgenome,theavailabilityandsimilarityofanear-neighborgenome,andtheselectedprobedesignconstraints.Thereferencedatabaseisseg-mentedaccordingtorulesofthumbsuggestedbythempiBLASTdevelopers(Darlingetal.,2003),wherethenumberofdatabasesegmentsissettothenumberofprocessors.Hence,thecomputa-tionaltimeofprocessingthereferencedatabaseisdirectlydepen-dentonthespeedupachievedbympiBLAST(Darlingetal.,2003).Theexecutiontimecouldbeimprovedbyusingotherparallelver-sionsofBLAST,suchaspioBLAST(Linetal.,2005). Casestudy:DNAfingerprintsforY.pestis TOFIwasusedtoidentifyDNAfingerprintsfortheplague-causingpathogenY.pestisstrainCO92(accessionno.NC_003143.1).Basedontheliterature(Chainetal.,2004),acloselyrelatedorganism,YersiniapseudotuberculosisstrainIP32953(accessionno.NC_006155.1),wasselectedasthenear-neighborgenome. TheempiricalthresholdT¼15wasselectedbasedonapriorianalysisofhybridizationpropertiesfortheselectedmicroarraytechnologyandexperimentalsetup,suchastheprobelengthandrequiredmelting-temperature.Thisparametercanbeadjustedonacase-by-casebasisusingfeedbackfromadditionalexperimentalevaluations.ForT¼15andmaximumprobelengthLmax¼40,accordingtostep3inTable2,theminimumlengthofexactmatchesinMUMmerisM¼26. Approximately96%oftheY.pestisgenomewasdiscardedusingMUMmerinthefirststep(Fig.3).Thustheideaofusinganear-neighborgenometoidentifyanddiscardexactmatchesprovedtobeextremelyeffectiveinthiscase.Outofabout4.6millionbasesofY.pestis,fewerthan200000bases,distributedunevenlyamong2222candidatesequences,wereconsideredfurther.Inthenextstep,slightlyover13600DNAprobessatisfyingtheexperimentalconstraintswereextractedfromthecandidatesequences. IntheBLASTprobespecificityevaluation,theseedsizew¼7wasselectedbecauseitwasthesmallestvalueavailableinthe Oligonucleotidefingerprintidentification BLASTversionthatweused,andalargeE-value¼100reducedthepossibilityofmissinghighscoringalignments. BasedonthespecifiedTOFIparameters,allbut146DNAprobeswererejected,definingtheinsilicoDNAfingerprintsthatwereselectedforexperimentalevaluationusingcustomDNAmicroar-rays.TheseDNAfingerprintsunderwentfurtherscreeningbasedonadditionalexperimentalconstraints,suchasthepresenceofrestrictionenzymecleavagesites,leavingonly99insilicofinger-printsfortesting. Experimentalevaluationofinsilicofingerprints TencustomizedDNAmicroarraychips,eachcontainingseveralreplicatesofthe99insilicoDNAfingerprintsandanumberofcontrolsequences,werefabricatedandusedforevaluationpur-poses.SixchipswerehybridizedwiththetargetgenomeY.pestisandfourchipswereusedtotestcross-hybridizationwiththenear-neighborgenomeY.pseudotuberculosis.Normalizeddatawereusedtocomparehybridizationsignals. Themicroarrayhybridizationdatawereusedtoanalyzethedis-criminatingpoweroftheinsilicofingerprintsbycomparingtheexperimentalhybridizationresultsoftheprobeswithY.pestisandY.pseudotuberculosis.Figure4illustratesasamplesetofdatashow-ingthenormalizedresponse(y-axis)asafunctionoftheDNAfingerprints,whicharearrangedindescendingorderofthediffer-encebetweentheirresponseswithY.pestisandY.pseudotuberculo-sis.Variabilityinthehybridizationresponsesinrepeatedexperimentsispresentedbystandarderrorbarsforeachprobe.Outofthe99DNAfingerprintstested,20(datanotshowninFig.4)producedhigheraverageresponseforY.pseudotuberculosisthanthatforthetarget.Thisisduetocomputationalandexperi-mentalreasons.ThecomputationalreasonsrelatetolimitationsofusingBLASTforspecificityevaluation,asdiscussedinSection3,step3.Adetailedpost-experimentalanalysisoftheBLASToutputsindicatesthat12outofthe20probesdonothavereportedalign-mentswithY.pseudotuberculosisinthesignificanthitlist.Fortheremainingeightprobes,contiguousmatchesof20basesormorewereobservedintheBLASTalignmentsbutthecalculatedsumofmismatchesandgapswaslargerthan15,causingtheseprobestobeidentifiedasfingerprints.ThistypeofproblemcouldbeavoidedifathresholdforthemaximumnumberofcontiguousmatchescouldbeexperimentallydeterminedandusedforadditionalfilteringoftheprobesthatpassedthefirstBLASTspecificitytesting.Theexperimentalreasonsrelatetothevariabilityofproberesponses.Althoughallofthe20probeshavelargermeanresponsesforY.pseudotuberculosisthanthatforY.pestis,onlysixoftheseprobeshavesignificantlylargerresponses.Theexperimentalreasonfortheobservedaberranthybridizationofthesesixprobesisnotclear.These20probeswereexcludedfromfurtherevaluation. Forafingerprinttobeusefulinadiagnosticassay,itshouldyieldaverylowresponsefornon-targetsandahighresponseforthetarget.Thus,afewDNAfingerprintsinFigure4thathaveagooddiscriminatorypowerbuthavearelativelyhighresponsefornon-targetswouldnotbeconsideredusefulondiagnosticassays.ThedatausedinFigure4canalsobeusedtoidentifyvalidfingerprintsbasedonalternaterules,suchasidentifyingquantifiablethresholdvaluesfortargetandnon-targetresponses.Forexample,25probescouldbeselectedbyusingaminimumthresholdvalueof2.0forY.pestisresponsesandamaximumthresholdvalueof1.0forY.pseudotuberculosisresponses,while20probescouldbeselected Fig.4.ComparisonofHybridizationofinsilicoFingerprintswithTarget(Y.pestis)andNon-target(Y.pseudotuberculosis). usingaminimumthresholdof2.0forY.pestisandallowingamaxi-mumthresholdof0.5forY.pseudotuberculosis.Ineachcase,asufficientlylargenumberofprobeswouldallowfordetectionredundancy. OtherapplicationsofTOFI Havinganear-neighborgenomeisnotarequirementforTOFI.Thenear-neighborgenomeisusedtoreducethesolutionsearchspaceasmuchaspossibleinthefirststep,whichiscomputationallytheleastexpensivestep.Thetargetgenomecouldbecomparedwithanysmallsetofgenomesusingsuffixtrees.Thehigherthenumberofmatchesidentifiedinthefirststepis,theloweristhenumberofcomputationsrequiredinthesubsequentsteps.Inthecurrentstudy,asinglenear-neighborcomparisonreducedthesearchspaceveryeffectively.Inthecaseinwhichacloselyrelatednear-neighborforthetargetisunknown,eitherarbitrarygenome(s)couldbeusedasnear-neighbor(s)orthefirststepcouldbeomitted.Infact,TOFIwassuccessfullyusedtoidentifyDNAfingerprintsforplasmidspPCP1,pCD1andpMT1inY.pestiswithoutusinganynear-neighbor.Becauseplasmidsaremuchshorter(aboutafewthousandbases)thanbacterialgenomes(typicallyoverafewmillionbases),thewholeplasmidcouldbeconsideredasasinglecandidatesequenceandsentdirectlyasinputtotheDNAprobedesignstep. TOFIwasalsousedtoidentifyfingerprintsofFrancisellatularensisstrainSCHUS4(accessionno.NC_006570.1).Thegenomicsequenceofthenear-neighborFrancisellaphilomiragiawasnotavailableand,therefore,thefirststepofTOFIwasomitted.Inthesecondstep,weusedOMPtoscanthewholeF.tularensisgenome,consistingofabout1.9millionbases.Overlapofadjacentprobeswaslimitedto10basestoreducecomputationtime.OMPidentifiedabout20000probes,whichweretestedforspecificitywithT¼15,resultingin250fingerprints.Furtherscreeningforrestrictionenzymecleavagesitesreducedthenumberofinsilicofingerprintsto121. Fourchipswerefabricatedusingseveralreplicasofthe121insilicofingerprintsandanumberofcontrolprobes.TwochipswereusedtotesthybridizationwithF.tularensisandtheothertwototestcross-hybridizationwithF.philomiragia.Weperformed 11 W.Tembeetal. initialevaluationusingacriterionsimilartotheoneemployedforY.pestis.IncontrasttotheY.pestishybridizationexperiments,onlyoneprobehadhigheraverageresponsewithF.philomiragiathanwithF.tularensis.Intheexperiment,85probesshowedanormalizedresponsewithF.philomiragiasmallerthan1.0,while81ofthoseprobeshadresponseslargerthan2.0withF.tularensis.Currently,alargenumberofadditionalexperiments,includingastandardpanelofnon-targetgenomes,arebeingperformedtoevalu-atethefingerprintsofY.pestisandF.tularensisbeforetheyareusedasprobesindiagnosticassays. Currentlimitationsandplansforimprovements TOFIhasalreadybeenusedinitscurrentconfigurationtoidentifyfingerprintsforanumberofpathogens.However,severalalgo-rithmicandimplementationissuesaffectitsperformanceandarebeingaddressed. ThescopeofDNAfingerprintidentificationinTOFIiscurrentlylimitedtoasingletargetsequence.Weareinvestigatingapproachestoselectfingerprintscommontoalargenumberofrelatedtargets,whichwouldallowfortheidentificationoffingerprintscommontospecificspeciesorgenus.ForhighlyvariableRNAviruses,uniquefingerprintsmaynotexist.Forthisapplication,anapproachbasedontheselectionofnon-uniqueprobes,whichtogethermayformuniquehybridizationpatternsonachipforunambiguousviralidentification,isalsobeingconsidered. AlthoughwehaveprovidedadefinitionofaDNAfingerprintandanalgorithmthatguaranteesallfingerprintssatisfyingitareiden-tified,validprobescouldpotentiallybediscardedduetosignificantoverlapwithadjacentprobesduringtheprobedesignphase(Step2ofTOFI).Thisrelatestopracticalconsiderationsinordertoreducethenumberof‘similar’probesonthechip,allowspaceformultiplereplicates,andlimitthetotalnumberofprobes,consideringthatseveralcontrolsequencesneedtobepresentonthemicroarray.ExperimentalevaluationsoftheidentifiedinsilicofingerprintsforY.pestisandF.tularensisindicatethepossibilityforimprovementinthealgorithmspecificity.Itwasfoundthatcross-hybridizationwithnon-targetgenomeswasnotdetectedbyBLASTinabout10%oftheinsilicofingerprintsforY.pestis,whileinanadditional10%ofthefingerprintsthecross-hybridizationwasunderestimated.Specificitywillbeimprovedinthefuturebasedon:(1)theselectionofoptimalTOFIparametersusingcomprehensiveevaluationoftheexperimentalresults;(2)thepost-processingofBLASTalign-mentsusingexpertrulestobettercorrelatealignmentswithhybrid-izationand(3)thedevelopmentofoptimalalignmentalgorithmsthatincludehybridizationthermodynamicsasapost-processingstepaftertheBLASTspecificityevaluation. Duetotheextremelyrapidgrowth/modificationsinavailableDNAsequences,thecontinuedvalidityoftheDNAfingerprintsmustbefrequentlyverified.Effortstoautomaticallyupdatefinger-printsarealsoplanned. definitionofaDNAfingerprintisprovided.Moreimportantly,giventhedesiredlengthofafingerprintanditsrequirednumberofnon-matchingbasepairs,weprovideanalgorithmthatguaran-teesthatallinsilicofingerprintsareidentified.Fingerprintsforanumberofpathogenicsequenceshavebeenpreliminarilyevaluatedthroughexperimentaltestswithpathogensofinterestandnon-targetgenomes.InitialresultsindicatethattheapproachiscapableofidentifyingmultiplefingerprintsforspecificDNAsequences,andthatthealgorithmcouldbeimprovedtoenhancespecificity.Furthertesting,withastandardpanelofnon-targetgenomes,isunderway.ThisinformationwillenableoptimalTOFIparameterselectionandwillserveasavaluablebenchmarkforfuturealgorithmimprovements. DISCLAIMER TheopinionsorassertionscontainedhereinaretheprivateviewsoftheauthorsandarenottobeconstruedasofficialorasreflectingtheviewsoftheU.S.ArmyortheU.S.DepartmentofDefense. ACKNOWLEDGEMENTS Theauthorswishtoexpresstheirgratitudetotheanonymousrefereesforusefulcommentsandsuggestionsaswellasinterestingideasforfuturework.TheauthorsthankKamalKumaroftheBiotechnologyHPCSoftwareApplicationsInstituteforhelpindevelopingTOFI’sgraphicaluserinterfaceandBobStephens,JackCollinsandKarolMiaskiewiczoftheAdvancedBiomedicalComputingCenter,NationalCancerInstitute,Frederick,MD,forthecomputationalsupport.ThisworkwassponsoredbytheU.S.DepartmentofDefenseHigh-PerformanceComputingModernizationProgram(HPCMP),undertheHigh-PerformanceComputingSoftwareApplicationsInstitutes(HSAI)initiative,andtheU.S.DefenseThreatReductionAgency.ConflictofInterest:nonedeclared. REFERENCES Abee,T.etal.(2004)Impactofgenomicsonmicrobialfoodsafety.TrendsBiotechnol., 22,653–660. Altschul,S.F.etal.(1990)Basiclocalalignmentsearchtool.J.Mol.Biol.,215, 403–410. Benson,D.A.etal.(2005)GenBank.NucleicAcidsRes.,34,D16–D20. Chain,P.S.etal.(2004)InsightsintotheevolutionofYersiniapestisthroughwhole-genomecomparisonwithYersiniapseudotuberculosis.Proc.NatlAcad.Sci.USA,101,13826–13831. Darling,A.etal.(2003)Thedesign,implementation,andevaluationofmpiBLAST. In4thInternationalConferenceonLinuxClusters:TheHPCRevolution2003,inconjunctionwiththeClusterWorldConference&Expo,SanJose,CA. Draghici,S.etal.(2005)Identificationofgenomicsignaturesforthedesignofassaysfor thedetectionandmonitoringofanthraxthreats.InAltman,R.B.etal.(ed.),Proceed-ingsofthePacificSymposiumofBiocomputing2005Hawaii,USA,pp.248–259.Gardner,S.etal.(2004)Sequencingneedsforviraldiagnostics.J.Clin.Microbiol.,42, 5472–5476. Gordon,P.M.K.andSensen,C.W.(2004)Osprey:acomprehensivetoolemploying novelmethodsforthedesignofoligonucleotidesforDNAsequencingandmicroar-rays.NucleicAcidsRes.,32,e133. Gusfield,D.(1997)AlgorithmsonStrings,Trees,andSequences:ComputerScience andComputationalBiology.CambridgeUniversityPress,Cambridge,UK. Hardiman,G.(ed.)(2003)MicroarrayMethodsandApplications-NutsandBolts.DNA Press,Eagleville,PA,USA. Hass,S.A.etal.(2003)Genome-scaledesignofPCRprimersandlongoligomersfor DNAmicroarrays.NucleicAcidsRes.,31,5576–5581. Ivnitski,D.etal.(2003)Nucleicacidapproachesfordetectionandidentificationof biologicalwarfareandinfectiousdiseaseagents.Biotechniques,35,862–869. CONCLUSIONS ThisworkpresentedTOFI,anintegratedbioinformaticstooltoidentifyinsilicogenomicfingerprintsforthedesignofmicroarraydiagnosticassays.TOFIisastandaloneapplicationthatexploitstheparallelprogrammingbenefitsprovidedbyHPCplatformsandallowsuserstoselectinputparametersthroughagraphicaluserinterface.Thisworkdiffersfrompreviousonesinthataformal12 Oligonucleotidefingerprintidentification Joos,T.andFortina,P.(2005)Microarraysinclinicaldiagnosis.HumanaPress, Totowa,NJ,USA. Kaderali,L.andSchliep,A.(2002)Selectingsignatureoligonucleotidestoidentify organismsusingDNAarrays.Bioinformatics,18,1340–1349. Kurtz,S.(2002)constructionandapplicationofvirtualsuffixtrees..PhDdissertation, ¨en,UniversitatBielefeld,Bielefeld,Germany.TechnischeFakulto Kurtz,S.etal.(2004)Versatileandopensoftwareforcomparinglargegenomes.BMC GenomeBiol.,5,R12. Leber,M.etal.(2005)AfractionalprogrammingapproachtoefficientDNAmelting temperaturecalculation.Bioinformatics,21,2375–2382. Lin,H.etal.(2005)EfficientdataaccessforparallelBLAST.IEEEInternational ParallelandDistributedProcessingSymposium,Denver,CO. Nordberg,E.(2005)YODA:selectingsignatureoligonucleotides.Bioinformatics,21, 1365–1370. Panjkovich,A.andMelo,F.(2005)Comparisonofdifferentmeltingtemperature calculationmethodsforshortDNAsequences.Bioinformatics,21,711–722.Pruitt,K.D.etal.(2005)NCBIReferenceSequence(RefSeq):acuratednon-redundant sequencedatabaseofgenomes,transcriptsandproteins.NucleicAcidsRes.,33,D501–D504. Rahmann,S.(2003)Fastlargescaleoligonucleotideselectionusingthelongestcom-monfactorapproach.J.Bioinfo.Compu.Biol.,1,343–361.SantaLucia,J.,JrandHicks,D.(2004)ThethermodynamicsofDNAstructuralmotifs. Annu.Rev.Biophys.Biomol.Struct.,33,415–440. Schliep,A.etal.(2003)GrouptestingwithDNAchips:generatingdesignsanddecod-ingexperiments.InProceedingsoftheComputationalSystemsBioinformatics,August11-14,Stanford,CA,pp.84–91. Sergeev,N.etal.(2006)MicroarrayanalysisofBacilluscereusgroupvirulencefactors. J.Microbiol.Meth.,65,488–502. Slezak,T.etal.(2003)Comparativegenomicstoolsappliedtobioterrorismdefense. Brief.Bioinform.,4,133–149. Urisman,A.etal.(2005)E-Predict:acomputationalstrategyforspeciesidentification basedonobservedDNAmicroarrayhybridizationpatterns.BMCGenomeBiol.,6,R78. Viljoen,G.J.etal.(eds)(2005)MolecularDiagnosticsPCRHandbook.Springer Publishers,Berlin,Germany. Wang,D.etal.(2002)Microarray-baseddetectionandgenotypingofviralpathogens. Proc.NatlAcad.Sci.USA,99,15687–15692. Weiner,P.(1973)Linearpatternmatchingalgorithms.InProceedingsof14thIEEE AnnualSymposiumonSwitchingandAutomataTheory,Washington,DC,IEEEComputerSoc.,pp.1–11. Willse,A.etal.(2004)Quantitativeoligonucleotidemicroarrayfingerprintingof Salmonellaentericaisolates.NucleicAcidsRes.,32,1848–1856. 13 因篇幅问题不能全部显示,请点此查看更多更全内容