您的当前位置:首页正文

BIOINFORMATICS ORIGINAL PAPER Sequence analysis

2020-07-20 来源:九壹网
BIOINFORMATICS

Sequenceanalysis

ORIGINALPAPER

Vol.23no.12007,pages5–13doi:10.1093/bioinformatics/btl549

Oligonucleotidefingerprintidentificationformicroarray-basedpathogendiagnosticassays

WaibhavTembe,NelaZavaljevski,ElizabethBode1,CatherineChase1,JeanneGeyer1,LeonardWasieloski1,GaryBenson2andJaquesReifmanÃBiotechnologyHPCSoftwareApplicationsInstitute,TelemedicineandAdvancedTechnologyResearchCenter,USArmyMedicalResearchandMaterielCommand,Ft.Detrick,MD,1DiagnosticSystemsDivision,USArmyMedicalResearchInstituteofInfectiousDiseases,Ft.Detrick,MDand2DepartmentsofBiologyandComputerScience,BostonUniversity,Boston,MA,USA

ReceivedonJune15,2006;revisedonOctober18,2006;acceptedonOctober21,2006AdvanceAccesspublicationOctober26,2006AssociateEditor:JohnQuackenbush

ABSTRACT

Motivation:AdvancesinDNAmicroarraytechnologyandcomputa-tionalmethodshaveunlockednewopportunitiestoidentify‘DNAfin-gerprints’,i.e.oligonucleotidesequencesthatuniquelyidentifyaspecificgenome.Wepresentanintegratedapproachforthecomputa-tionalidentificationofDNAfingerprintsfordesignofmicroarray-basedpathogendiagnosticassays.WeprovideaquantifiabledefinitionofaDNAfingerprintstatedbothfromacomputationalaswellasanexperimentalpointofview,andtheanalyticalproofthatallinsilicofingerprintssatisfyingthestateddefinitionarefoundusingourapproach.Results:Thepresentedcomputationalapproachisimplementedinanintegratedhigh-performancecomputing(HPC)softwaretoolforoligonucleotidefingerprintidentificationtermedTOFI.WeemployedTOFItoidentifyinsilicoDNAfingerprintsforseveralbacteriaandplas-midsequences,whichwerethenexperimentallyevaluatedaspotentialprobesformicroarray-baseddiagnosticassays.Resultsandanalysisofapproximately150insilicoDNAfingerprintsforYersiniapestisand250fingerprintsforFrancisellatularensisarepresented.

Availability:Theimplementedalgorithmisavailableuponrequest.Contact:jaques.reifman@us.army.mil.

INTRODUCTION

Therecentadvancesingenomicsequencingandtheavailabilityoflarge-scalesequencedatabaseshaveunlockedseveralopportunitiestoidentify‘genomicsignatures’or‘DNAfingerprints’,i.e.shortDNAsequencesthatuniquelyascertainthepresenceorabsenceofcausativebiologicalagents,suchasviruses,bacteriaorvirulentgenes.Forexample,avastnumberofDNA-baseddetectionanddiagnostictechnologiesarebeingdevelopedtoquicklyidentifybiologicalthreatagents(Ivnitskietal.,2003;Slezaketal.,2003;Draghicietal.,2005;KaderaliandSchliep,2002),suchastheanthrax-causingbacterium,Bacillusanthracis,andtheplague-causingbacterium,Yersiniapestis.DNAsignaturescouldalsobeusedtodetectthepresenceofoneormorevirulentgenes,suchasBacillusgenes,whichencodeimportantvirulencefactors,entereotoxinsandexotoxins(Sergeevetal.,2006),andtoprovide

ÃTowhomcorrespondenceshouldbeaddressed.

high-resolutiondifferentiationbetweencloselyrelatedmicroorgan-ismsinmicrobialforensics(Willseetal.,2004).Newvirusesandstrainshavebeenidentifiedusingaspecialmicroarraytechnologyconsistingofapproximately1100070meroligonucleotides(Wangetal.,2002).DNAfingerprintshavealsobeenusedtodevelopdiagnosticassaysforawide-rangeofimportantapplicationsinmedicine,environmentalmonitoringandqualitycontroloffoodproducts(Hardiman,2003;JoosandFortina,2005;Wangetal.,2002;Abbeetal.,2004).

ThespecificalgorithmimplementedinaDNAfingerprintidenti-ficationmethodisselectedbasedon(1)whethertheDNAfinger-printsarebeingsoughtforaspecificpathogenstrain(e.g.Y.pestisCO92),agroupofpathogensfromthesamespecies(e.g.allY.pestisstrains)orgenus(e.g.allYersiniaspecies),orasetoforganismsthatmayormaynothaveanyphylogeneticrelationship(e.g.todetectaviralorabacterialfamily)and(2)theexperimentalconditionsspecifiedbytheendapplicationtechnology,suchasPCR(Slezaketal.,2003;Viljoenetal.,2005;Haasetal.,2003;GordonandSensen,2004)orDNAmicroarrays(KaderaliandSchliep,2002;Hardiman,2003;Rahmann,2003;Leberetal.,2005;Nordberg,2005).Theuseofreal-timePCR-baseddetectiontechnologyrequirestheidentificationofthreeinformativesequences:twoamplificationprimersequencesandanadditionalprobesequence(thefinger-print).Theassayrequiresthatprimerhybridizationtakesplacenearthefingerprintand,therefore,imposesconstraintsontheposi-tionoftheprimerandPCR-basedfingerprints.Moreover,PCR-basedassaysarequitelimitedintheirmultiplexingcapabilities,asdifferentassaysarerequiredtodetectdifferentpathogenicsequences.Incontrast,microarraysdonotimposeanypositionspecificconstraintsontheDNAfingerprints,andseveralfinger-printscanbesimultaneouslyplacedonamicroarraytoprovidedetectionredundancyandallowforthediagnosisofmultiplepatho-gensonasingleassay.Despitetheseadvantages,microarray-basedassaysarerelativelyinsensitiveandslowcomparedtotheexquisitesensitivityandspeedofPCR-basedassays.Microarraysensitivitycanbegreatlyenhancedbyincorporatingsampleamplificationpriortohybridizationbut,unfortunately,thisresultsinanetincreaseinassaytimeforalreadyslowassays.

ThispaperisconcernedwiththeidentificationofDNAfinger-printsforspecific,singlepathogenicsequences,referredtoasthe

ÓTheAuthor2006.PublishedbyOxfordUniversityPress.Allrightsreserved.ForPermissions,pleaseemail:journals.permissions@oxfordjournals.org

5

W.Tembeetal.

target,forthedesignofDNAmicroarray-baseddiagnosticassays.Thetargetcouldbeanentiregenome(e.g.B.anthracisAmes),achromosome(e.g.Brucellamelitensisbiovarabortus2308chromo-someII)oranon-chromosomesequence(e.g.B.anthracisplasmidpXO2).Amoregeneralproblem,notaddressedhere,involvestheidentificationofDNAfingerprintscommontomultiplestrainsormultiplespecies.AneffectiveapproachforthisproblemistousemultiplegenomealignmentandsearchforconservedregionstoidentifycommonDNAsignatures(Slezaketal.,2003).Identifica-tionofcommonDNAsignaturesbecomesanevengreaterproblemforhighlyvariableRNAviruses(Gardneretal.,2004),whereapromisingsolutionistoselectcombinationsofnon-uniqueprobesanduseuniquehybridizationpatternstounambiguouslyidentifyspecificviralstrains(Urismanetal.,2005;Schliepetal.,2003).Giventhelonglengthofmosttargets,theidentificationofDNAfingerprintsisaproblemofhighcomputationalcomplexity.Thepotentialsolutionspaceisextremelylargebecauseeverysubse-quenceofthetargetsequenceneedstobeconsidered.Furthermore,thedeterminationofuniquenessrequirescomparisonwithnucleot-idedatabases,suchastheGenBank(Bensonetal.,2005),thataregrowingexponentiallyinsize.Moreover,uniquenessofDNAfin-gerprintsobtainedusingsuchcomparativealgorithmsisonlyvalidwithrespecttothereferencedatabaseused.Asnewsequencesaremadeavailable,previouslyidentifiedfingerprintsneedtoberevalidated.

SeveralpracticalchallengesariseintheexperimentalevaluationofthecomputationallyidentifiedDNAfingerprintsformicroarray-basedassays.TheDNAfingerprintsshouldproduceahighresponsewhenhybridizedwithasamplecontainingthetargetgenome.Con-versely,theresponseforanynon-targetgenomeshouldbeaslowaspossible.Thus,algorithmsforDNAfingerprintidentificationmustincludeexperimentalconstraintsandDNA–DNAhybridiza-tionmodelingmethodstopredicttheresponseonamicroarray.AlthoughresearchinmodelingmolecularlevelinteractionsbetweenDNAsequenceshasmadesignificantprogress,thereis,unfortunately,noanalyticalmethodavailabletodaythatcanpredicttheexactoutcomeofahybridizationreactionbetweentwoormorearbitraryDNAsequences(SantaLuciaandHicks,2004;Nordberg,2005).Moreover,duetothevariabilityintheoutcomeofahybrid-izationexperiment,alargenumberofrepetitionsarerequiredtoexperimentallyevaluatetheDNAfingerprints.Itmightnotbepos-sibletoexperimentallytestallthecomputationallyidentifiedfin-gerprintsbecauseoftheassociatedcostsandlimitedresources.Tosimultaneouslyaccommodatethesecomputationalneedsandexperimentalconstraints,DNAfingerprintidentificationtools(KaderaliandSchliep,2002;Rahmann,2003;Leberetal.,2005)integratecomputationalalgorithmsforidentifyinguniquesequencesandDNAhybridizationmodelingtoolsforpredictingtheoutcomeofthemicroarrayexperiment.Often,thesetoolsapplyvariousapproximationstoreducethecomputationalcomplex-ity.Forexample,theintegratedapproachofKaderaliandSchliep(2002)usesanefficientsearchalgorithmbasedonsuffixtreesandasimplifiedtwo-statetransitionnear-neighborthermodynamicmodelforDNAprobedesignandcross-hybridization.Thesimpli-fiedmodelreducesthecomputationaltimebutintroducesmodelingerrorsintheDNAfingerprintdesign.Asimilarthermodynamicmodelwithacomputationallymoreefficientapproachwaspro-posedbyRahmann(2003).Anefficientfractionalprogrammingapproachformelting-temperaturecomputationwithanimproved6

two-statetransitionnear-neighborthermodynamicmodelhasalsobeenproposed(Leberetal.,2005),however,computationaltimewouldstillbeanissueforcross-hybridizationevaluationofalargenumberofnon-targetgenomes.

ComputationalandexperimentalfactorsmakequantificationoftheuniquenessorspecificityofashortDNAsequencechallenging.Infact,aliteraturesurveyindicatesthattherelatedstudieshavenotstatedaprecise,quantitativedefinitionofaDNAfingerprint.Althoughthegeneralideaistosearchthetargetgenomefor‘unique’DNAsequencesandthentestthemexperimentally,theinsilicocriterionforuniquenesshasnotbeenexplicitlystated.Inthispaper,wefirstprovideaformaldefinitionofaDNAfingerprintbasedonvariousexperimentalconditionsandaspecificitycriterion.Wethendescribeanintegratedapproachthatcombinesefficientbioinformaticsalgorithms,takesintoaccountexperimentalcon-straints,andincludesalarge-scalecomparisonofDNAfingerprintswithnucleotidedatabases.Next,wedescribethealgorithmunder-lyingTOFI(toolforoligonucleotidefingerprintidentification),itssoftwareimplementationonahigh-performancecomputing(HPC)platform,andananalyticalapproachtochoosetheinputparametersofTOFI,whichguaranteesthatallpossibleDNAfingerprintssat-isfyingthestateddefinitionareobtained.Finally,wediscussinitialexperimentalresults,whichhelpevaluateourdefinitionofaDNAfingerprintandtheassociatedspecificitycriterion.

TERMINOLOGYANDPROBLEMDEFINITION

ADNAfingerprintforagiventargetgenomegtisdefinedwithrespecttoareferencenucleotidesequencedatabasedenotedbyG¼{g1,g2,...,gn,...,gN}thatcontainsNsequences.Inpractice,GconsistsofDNAsequencesfromoneormorepubliclyavailablecomprehensivedatabases,suchasGenBank,oranyothersmallernucleotidedatabase,suchasaviralDNAsequencedatabase.ThetargetgenomemayormaynotbelongtoG,implyingthatitcouldbeaknownpathogenoranewlysequencedone.ImplicitinthedefinitionofaDNAfingerprintisitsvaliditywithrespecttotheavailabledatabase.AsnewerDNAsequencesbecomeavailableandareaddedtoG,itisnecessarytoverifythevalidityofthepreviouslyidentifiedfingerprints.

Basedontheschoolofthought,computationalorbiological,thedefinitionofaDNAfingerprintvaries.Therefore,somediscussionaboutournotionofaDNAfingerprintisinorder.

Fromapurecomputersciencestandpoint,aDNAfingerprintofgtcouldbedefinedas‘anysubsequenceofgtthatisnotasubse-quenceofanygn2G,n¼t’.Bythisdefinition,theproblemofidentifyingDNAfingerprintsisequivalenttotheclassicstringcomparisonproblemofidentifyingsubstringsofgtthatdonotexactlymatchanysubstringofanygn2G,n¼t.Althoughmathe-maticallycorrect,thisdefinitionlackstheapplication-specificrequirements.DNAfingerprintshavetosatisfy:(1)designcon-straints,sothattheycanbeusedasDNAprobesonmicroarraysand(2)specificityconstraints,sothattheycandiscriminate,inamicroarrayhybridizationreaction,betweentargetandnon-targetsequences.DNAfingerprintsthatsimultaneouslysatisfybothdesignandspecificityconstraintsrequireabiologicallymoresounddefinition.Wemathematicallyformalizetheexperimentalandspecificityconstraintsasfollows.

LetKdenotetheDNAmicroarrayexperimentalconstraints,suchastheminimumandmaximumlengthoftheDNAfingerprint,the

Oligonucleotidefingerprintidentification

hybridizationmelting-temperature,GCcontent,etc.(SantaLuciaandHicks,2004),andletP¼{p1,p2,...,pi,...,pI}denotethesetofallsubsequencesofgtthatsatisfyK.Thus,bydefinition,everypi2Pwillhavelengthwithinthespecifiedminimum(Lmin)andmaximum(Lmax)bounds,GCcontentwithintherequiredrange,andwillsatisfyseveralotherpropertiesspecifiedbythechosenDNAhybridizationmodelingmethodology.WerefertothesequencesinPasDNAprobes.NotethatconstraintsdenotedbyKdonotspecifyattributesregardingtheuniquenessofaDNAprobewithrespecttonon-targetsequences.

QuantifyingspecificityofDNAfingerprintsfromanexperimentalpointofviewisverysubjective.ItisbasedoninterpretingtheexperimentalhybridizationresultsbetweenDNAprobesandnon-targetDNAsequences.Sincethereisalackofaccurateinsilicohybridizationmodels,weinferthespecificityofaDNAprobebyfirstcomputingDNAsequencealignments,andthendeterminingifthealignedprobemeetsanempiricalthresholdT.ThisisbasedonthehypothesisthatDNAsequencesthatalignpoorlyareunlikelytoformastableDNA–DNAduplexforagivensetofexperimentalconstraints.ThishypothesisimpliesthattheDNAsequencealignment,calculatedstrictlyusingcom-putationaltools,providesquantification,throughthethresholdT,oftheactualstrengthoftheDNA–DNAduplex.Thus,wecomputethespecificityofaDNAprobefromthenumberofmismatches,gapsandinsertions/deletionsinthealignmentandcompareitwiththethresholdT,representingthesetUofallspecificityconstraints.Havingformalizedtheexperimentalandspecificityconstraints,wedefineaDNAfingerprintandtheproblemofidentifyingallDNAfingerprintsforatargetgenomeasfollows:

Definition(DNAfingerprint):ADNAprobepioflengthLiisconsideredaDNAfingerprintofgtifandonlyifanoptimalsequencealignmentbetweenpiandanyothersequencegn2G,n¼t,hasatmostLi-Tmatches.

Definition(DNAfingerprintidentification):ForatargetDNAsequencegt,findallinsilicoDNAfingerprintsthatsatisfytheexperimentalconstraintsKandspecificityconstraintsUwithrespecttoareferenceDNAsequencedatabaseG.

LetS¼{s1,s2,...,sf,...,sF}beasubsetofP,i.e.S󰀁P,thatdenotesthesetofDNAprobesthatsatisfybothconstraintsKandU.OurgoalistofindallFelementsofS.WerefertotheelementsofSasinsilicoDNAfingerprintsbecausetheysatisfyallconstraintsthathavebeenquantifiedforcomputationalpurposes.Theirexperi-mentalvalidityneedstobetestedinanactualDNAmicroarrayexperiment.Unlessstatedotherwise,henceforththeterm‘DNAfingerprint’impliesinsilicoDNAfingerprint,whichisvalidwithrespecttoareferencedatabaseused.

Fig.1.TheThreeStepsoftheTOFIAlgorithm.

database.Auser-definedspecificityconstraintUisusedtointerpretthealignmentsfromthestandpointofcross-hybridizationbetweentheDNAprobesandnon-targetgenomes.DNAprobessatisfyingUarereportedasDNAfingerprintsandaretestedonmicroarrays.

Althoughtheproblemhasbeensplitintothreediscretestepsforclarityofexplanation,theindividualstepsarenotcompletelyinde-pendentfromanalgorithmicstandpoint.Infact,ourobjective,toobtainallDNAfingerprints,leadsustoconcludethattheinputparametersinthefirststephaveananalyticalrelationshipwiththeconstraintsimposedinthesecondandthethirdsteps.Wediscusstheinterdependenceofthethreestepsafteranin-depthdescriptionofeachofthethreesteps.

Step1:solutionspacereduction

Thesolutionspacetobesearchedisextremelylargebecauseeverysubsequenceofthegiventargetmustbeconsidered.Testingeachsubsequenceexperimentallyisimpracticalandexpensive.Butreducingthesolutionspacecomputationallycanbequickandcheaper.Forthispurpose,weexploitthesequencesimilaritiesbetweenthetargetgenomeandanevolutionarynear-neighbor(gr)thatcanbeidentifiedfromaphylogenetictreeorpublisheddata.ThetargetandneighborwillcontaincommonDNAsequences,which,obviously,cannotbeusedasDNAfingerprints.DNAsequencescommontobothgtandgrareextractedusingsuffixtrees(Wiener,1973;Gusfield,1997),which,withinthedomainofcomparativegenomics,havebeenusedinVmatch(Kurtz,2002)andtheMaximalUniqueMatcher(MUMmer)(Kurtzetal.,2004)toidentifyrepeats,exactorapproximatematches,andsinglenucleotidepolymorphisms.Detailsoftheconstruction,traversalandnumerousapplicationsofsuffixtreesinseveraldif-ferentstring-matchingapplicationsareavailablein(Gusfield,1997).Itshouldbenotedthatthesolutionspacecouldbefurtherreducedbycomparingthetargetgenomewithmultiplenon-targetgenomes,asdescribedbySlezakandcolleagues(Slezaketal.,2003).ThiscouldbedonewithinTOFI’scurrentalgorithmiccon-figurationbyconcatenatingstringsfrommultiplenon-targetgen-omesintoonelongstringandprovidingitasthenear-neighborgenomeforcomparison.Analternate,perhapsmoreefficient

7

INTEGRATEDAPPROACH

TOFIimplementsamulti-stepapproachthatbreaksdowntheprob-lemoffingerprintidentificationintothethreestepsillustratedinFigure1.ThefirststepreducesthesolutionspacebydiscardingDNAsequencescommontoboththetargetsequenceandoneormorebiologicalnear-neighborsequences.Thesurvivingsequencesaretermedcandidatesequences.Inthesecondstep,amicroarrayDNAprobedesignphaseextractsfromcandidatesequencesonlythosesubsequencesthatsatisfytheapplication-specificexperi-mentalconstraintsK.Inthethirdstep,eachDNAprobeisalignedwithallDNAsequencespresentinthechosenreferencenucleotide

W.Tembeetal.

Table1.TypicalvaluesforDNAprobedesignconstraintsKLength(bases)

Fig.2.OutputoftheSuffix-Tree-BasedAlgorithm.

Minimum,Lmin:35Maximum,Lmax:40

Melting-temperature(󰀂C)Minimum,Tmin:70Maximum,Tmax:75

GCcontent(%)45–50

approach,whichwouldrequiresoftwaremodification,istocomparethetargetsequentiallyagainstalistofnon-targetsequences,sothataftereachcomparisononlyunmatchedsequencesarecomparedwiththesubsequentnon-targetgenomesfromthelist.

Oncetheexactmatchesbetweenthetargetgtanditsnear-neighborgrareidentified,thetargetcanberepresentedasacon-catenationgt¼C0M1C1M2C2...MJCJ,asshowninFigure2.MjdenotesthejthexactmatchoflengthjMjjandCjdenotesthejthcandidatesequence,i.e.asequenceinthetargetthatcontainsnomatchesoflengthMorlongerwiththenear-neighbor.C0and/orCJcanbenullbasedonwhetherornotthereisanexactmatchatthebeginningortheendofgt,respectively.ExactmatchesthatarelongerthantheminimumlengthMarediscardedandonlythecandidatesequencesareretainedforfurtherconsideration.Thecandidatesequenceshavenorestrictionwithrespecttotheirlength,positioninthegenome,orcompositionofbasepairs.Ofparticularinteresttoourapplicationisthechoiceofinputparameterstothesuffix-tree-basedalgorithm,inparticular,theminimumlengthMofexactmatchesbetweengtandgrthatwouldleadtoidentificationofallDNAfingerprints.OuranalysisindicatesthattheparameterMiscloselyrelatedtotheexperimentalandspecificityconstraints,detailedinSteps2and3,respectively.Therefore,wefirstdescribetheremainingtwostepsofTOFIbeforeananalyticalrelationisderivedbetweentheparameterMandtheconstraintsimposedbytheproblemdefinition.

ThisselectionofMdiffersfromarelatedstudy(Slezaketal.,2003),whereM¼18washeuristicallyselectedtomeettheminimalPCRprimersize.

Step3:specificitydeterminationbysequencealignment

Inthethirdstep,everyDNAprobeisalignedwithsequencesinthereferencenucleotidedatabase.Theresultsofthealignmentsareinterpretedtopredictcross-hybridizationusingthefollowinggeneralrule:ADNAprobethatalignspoorlywithallnon-targetsDNAsequencesisunlikelytocross-hybridizewithnon-targetsand,therefore,shouldbeconsideredasafingerprint.

DuetothelimitationsofDNA–DNAhybridizationmodels,deter-miningthealignmentcorrespondingtotheoptimalDNA–DNAduplexonamicroarrayishard.Computationally,optimalalignmentbetweentwoDNAsequencescouldbedefinedusingthegeneralizededitdistancealgorithm(Gusfield,1997).Simplyput,theeditdis-tancebetweentwosequencescorrespondstothetotalnumberofinsertions,deletionsandsubstitutionsthatareneededtotransformonesequenceintotheother.FromthestandpointofDNAcross-hybridization,asubstitutioncorrespondstoamismatchedpairofnucleotidesandinsertions/deletionscorrespondtogapsintheDNA–DNAduplex.Thelowerthenumberofmismatchesandgapsinthealignmentisthelowerintheeditdistance.However,editdistancedoesnotprovidesufficientinformationwithregardstothestrengthofhybridization.Forexample,itdoesnotconsiderthepositionofmatchesinthealignment,GCcontent,gaplengthandthelongestcommonfactorinthealignment(Rahmann,2003).Moreover,computingtheoptimalalignmentbetweeneachDNAprobeandeveryDNAsequenceinalargedatabase,suchasthe‘nt’nucleotidedatabasefromtheNationalCenterforBiotechnologyInformation(NCBI,http://www.ncbi.nlm.nih.gov/)(Pruittetal.,2005),wouldbeverycomputationallyintensive.

Basedontheseissuesandpracticaltimeconstraints,weoptedtousetheBLASTNprogramfromBLAST(Altschuletal.,1990)foraligningDNAprobeswithareferencedatabase.ThealignmentalgorithminBLASTisaheuristics-basedapproachthatstartsoffbyidentifyingawordofexactmatchofagivenlength(wparameter)andproceedsbyextendingitusingdynamicprogrammingtoallowmismatchesandgapsinthealignment.AstatisticalsignificancescoretermedE-valueisusedtodistinguishbetweenpotentiallymeaningfulalignmentsandchancealignments.TheE-valuescorewasusedin(Draghicietal.,2005)toquantifythespecificityofDNAprobes.However,E-valuesaredeterminedbythelengthofthealignment,sizeofthequery,sizeofthetotaldatabaseandseveralotherparametersthatarenotrelatedtotheabilityofaprobetoformacross-hybridwithanon-targetgenome.

InsteadofusingE-valuealonetodetermineprobespecificity,TOFIexaminestheactualalignmentsreportedbyBLASTanddeterminesthespecificityofaprobebytakingintoaccountthenumberofmatches,mismatchesandgapsinthealignmentinde-pendentofitsstatisticalsignificance.ThesespecificityconstraintsUformthebasisfortheempiricalthresholdT,usedinthefollowing

Step2:microarrayprobedesign

ThesecondstepimposesasetofexperimentalconstraintsKtoextractDNAmicroarrayprobesfromthecandidatesequences.Arecentreview(PanjkovichandMelo,2005)indicatesthatforthesameinputDNAsequencesdifferentinsilicoprobedesignmod-elingtools,notsurprisingly,producedifferentsetsofDNAprobes.Toourknowledge,thereisnouniversallyacceptedmodelingmeth-odologyavailabletodaytodesignmicroarrayprobesfromDNAsequences.Often,thesetoolsareusedinaniterative,trial-and-errorfashiontooptimizethequality/numberofoutputDNAprobestosuitetheapplication-specificneeds.

Wehaveselectedaprobedesigntoolthatimplementsamulti-statethermodynamicmodelformelting-temperature(SantaLuciaandHicks,2004).ThemodelallowsfortherepresentationofseveraldozenconstraintsontheDNAprobes,suchasprobelength,GCcontent,molarconcentrations,self-hybridizationpossibilitiesandlimitonthenumberofsinglenucleotiderepeats.AdditionalinformationontheconstraintsKcanbefoundinSantaLuciaandHicks(2004).Asanexample,onlyafewimportantconstraintsareshowninTable1.TheDNAprobessatisfyingtheseconstraintsareextractedfromeverycandidatesequenceandarepassedontothenextstep.8

Oligonucleotidefingerprintidentification

Table2.AlgorithmforcandidatesequenceselectionusingthesuffixtreeoutputVariables:

Procedure:

Input:Lmax,Lmin,T,gtandgrgt¼Targetgenome

gr¼Near-neighborgenomeofgt(1)Pcand¼emptysetLmin¼Minimumprobelength(2)LetE¼LmaxÀTLmax¼Maximumprobelength(3)LetM¼E+1¼LmaxÀT+1T¼Thespecificityconstraint,i.e.thecombinedminimum(4)Usingsuffixtree,identifyexactmatchesM1,...,MJoflengthnumberofmismatches,insertionsanddeletionsintheoptimalatleastMbetweengtandgr,andcandidatesC0,C1,...,CJfromgtalignmentbetweenaprobeandanynon-targetsequence,definingafingerprint(5)ForeachcandidatesequenceCjfromthesuffixtreeoutput,M¼Inputparameterofthesuffix-tree-basedalgorithmthatspecifies(A)IfCjisaprefixofgt,thenextendCjbyEbasestotheminimumlengthofexactmatchesbetweengtandgrtheright.GotostepD

(B)IfCjisasuffixofgt,thenextendCjbyEbasestoMj¼jthexactmatchbetweengtandgr,wherej¼1,...,J

Cj¼jthcandidatesequence,j¼0,1,...,J,definedasasubsequencetheleft.GotostepD

(C)ExtendCjontheleftandtherightbyEbasesofgtthatis:

boundedonbothsidesbyexactmatchesoflengthatleastM,or(D)AddallsubsequencesofCjsatisfyingthelengthlocatedbetweenthestartofgtandthefirstexactmatchoflengthatleastconstraintstoPcandM,orlocatedbetweenthelastexactmatchoflengthatleastResult:EverycandidateDNAprobeofgtsatisfyingconstraintsMandtheendofthegenomegtLmax,Lmin,andTwillbeinPcandE¼Lengthoftheextensionofcandidatesequencesintotheadjacentexactmatch(es)

hypothesistoinferhybridizationpatternsfrom‘optimal’BLASTalignments:IfthebestBLASTalignmentbetweenaDNAprobeandanon-targetgenomehasmorethanTmismatchesorgaps,thentheDNAprobewillbeconsideredasaninsilicoDNAfingerprint.ItiswelldocumentedthatusingBLASTforassessmentofcross-hybridizationofaprobewithnon-targetgenomeswillresultinsomenon-specificprobes(Rahmann,2003;Nordberg,2005).Ifawordoflengthwisnotfoundinadatabasesequence,theprobealignmentwiththesequencewillbeskippedresultinginpotentialmissedcross-hybridization.Inothersituations,partialalignmentswithprobesmayresultinunderestimatedcross-hybridization.Twopromisingapproachescouldbeconsideredtoimproveprobespe-cificity.First,additionalfilteringoftheprobesselectedasfinger-printsbyBLASTcouldbeperformedtoaugmentthehypothesisrelatingalignmenttohybridization.Inthiscase,additionalinforma-tionwouldbeextractedfromanalysesofthealignmentsreportedbyBLAST,suchasthemaximumnumberofcontiguousmatchesorthepositionofmatchesintheprobealignment,andusedasrulestoimprovespecificityconstraints.Second,betteralignmentalgo-rithmscouldbeimplementedasapost-processingstep,whichwouldincorporatehybridizationthermodynamicsintothealign-mentevaluationtotakeintoaccounthybridizationstability(Leberetal.,2005).However,itmustbeemphasizedthatthelackofanaccuratemodeltodirectlyrelateaDNAsequencealign-mentwithitscorrespondingDNA–DNAhybridizationleavesthechoiceofprobespecificitycharacterizationasanopenquestion.

TOFIparameterselection

ThethreestepsinTOFIimplementdifferentbioinformaticsalgo-rithms,eachcarryingoutadifferenttaskusingitsownsetofinputparameters.However,theminimumlengthoftheexactmatchesMinthefirststepisanalyticallyrelatedtothelengthconstraintsLminandLmaxontheDNAprobesinthesecondstepandthespecificitythresholdTusedinthethirdstep.Inthissection,wemathematicallyderivethisanalyticalrelationship.

TheproblemofselectinganappropriateMvaluecouldbestatedasfollows:giventhelengthconstraintsLminandLmaxonthelengthoftheprobesandthespecificitythresholdT,findarelationshipbetweenLmin,Lmax,TandM,whichguaranteesthatnovalidDNAfingerprintsarediscarded.

OurapproachisinitiatedbyextendingeachcandidatesequenceCj,j¼0,1,...,J,byEbasesintoeachsideoftheneighboringexactmatch.ThispreventsthepossiblediscardingofsignaturesthatincludetheboundariesofCj.FromtheextendedcandidatesequencesweconstructacandidateDNAprobesetPcand,whichcontainseverysequencesatisfyingthelengthconstraintsLminandLmax.OnlythoseDNAprobesinPcandthatsatisfytheexperimentalconstraintswillbeincludedintheprobesetPforalignmentwiththereferencedataset,i.e.P󰀁Pcand.

WechooseEsuchthatM>E.Thisconditionguaranteesthattheoverlapsbetweentwoadjacentextendedcandidates,ifany,willbelimitedtotheexactmatchregionseparatingthetwocandidates.ItalsosetsalowerlimitonM,M¼E+1.ToensurethatanycandidateDNAprobeoflengthLihavinglessthanorequaltoLiÀTexactmatchesisnotdiscarded,theextensionlengthshouldbeE¼LiÀT.Thus,theextensionlengthEisconstrainedbyLminÀT E LmaxÀT.SubstitutingforM¼E+1andmakingaconservativeselection,weobtainM¼LmaxÀT+1.Suchselectionwill,mostlikely,generateacandidateprobesetPcandthatcontainssomefalsepositives,i.e.probesthatdonotsatisfythespecificityconstraints.Themajorityofsuchnon-specificprobeswillbediscardedaftertheBLASTalignmentinspection.However,somefalsepositiveswillremainduetopossiblemissedmatchesinBLAST,asdescribedintheprevioussection.ThedetailsofthecandidateprobeselectionalgorithmaregiveninTable2.

Finally,weprovethatifthecandidatesequencesareobtainedusingE¼LmaxÀT,withM¼E+1,thenallDNAfingerprintsisincludedinthesetPcand.However,asSdenotesthesetofDNAfingerprintsforgt,itwillsufficetoprovethatsuchselectionforMguaranteesthatS󰀁Pcand.

9

W.Tembeetal.

Assertion1.Byconstruction,everysubsequenceofgtcontaininganexactmatchoflengthsmallerthanMisincludedinanextendedcandidatesequence(steps5.A–5.CinTable2).

Assertion2.Fromeachextendedcandidatesequence,everysub-sequencesatisfyingthelengthconstraintsLminandLmaxisincludedinPcand(step5.DinTable2).

Assertion3.Fromassertions1and2,noneofthesequencesinPcandcancontainanexactmatchoflengthMorgreater.

Assertion4.Bydefinition,aDNAfingerprintsi2ScontainsatmostLiÀTexactmatcheswhenalignedwithanynon-targetgenome.Thus,thelengthofthelongestexactmatchbetweensiandanyothernon-targetgenomeisLiÀT.Since,LiÀT LmaxÀT¼ERESULTS

Softwareimplementation

WeusedMUMmer(Kurtzetal.,2004),anopensourcesoftwarethatimplementsasuffix-tree-basedalgorithmandprovidesseveraloptionsforcomparinggenomicsequences.ThemicroarrayDNAprobedesignfromcandidatesequenceswascarriedoutusingthecommercialsoftwareoligonucleotidemodelingplatform(OMP)(availableathttp://www.dnasoftware.com),whichimplementsastate-of-the-arthybridizationmodel(SantaLuciaandHicks,2004).TheBLASTNprogramfromNCBI_BLAST(version2.2.10)wasusedforaligningmorethan2.0millionnucleotidesequencesstoredinthe‘nt’nucleotidedatabaseattheNCBI.Thedatabasehasgrownsignificantlysinceweobtainedtheresultsdescribedinthispaperandwehavedownloadedthelatestversion,containingmorethan3.6millionsequences,forfuturerunsofTOFI.TheentiresoftwarepipelinewasinitiallyimplementedonaHPCenvironmentattheAdvancedBiomedicalComputingCenter(http://www-fbsc.ncifcrf.gov/)usingtheHighThroughputCom-putingsupportfromSGIÒonanAltixclusterconsistingof64·1.5GHzItanium2processorsrunningRedHatÒLinuxwith64GBofsharedmemory.Oncecandidatesequencesareobtained,TOFItakesadvantageoftheparallelprogrammingoppor-tunitiesonHPCresources.TheDNAprobedesignusingOMPhasbeenparallelizedusingOpenMPbyschedulingDNAprobedesignforeachcandidatesequenceonaseparateprocessor.TheexecutionofBLASTN,byfarthemostcomputationallyintensivepartofTOFI,isparallelizedbyassigningbatchesofDNAprobestosepa-rateprocessors.Inaddition,severalapplication-specificsoftwaremodulestoprocessDNAsequences,tocompileresultsoftheinter-mediatestagesforanalysis,andtoprocessoutputsofvariousstageswereimplemented.ThischoiceofresourcesandsoftwareisjustonewaytoimplementTOFI’sintegratedapproachshowninFigure1.Adifferentchoiceofsoftwareforsuffixtree,DNAprobedesignandsequencealignmentscouldbeusedaswell.How-ever,ourparticularchoicerepresents,arguably,someofthebesttoolsavailableforeachofthethreesteps.

TOFIhassincebeenportedontooneoftheU.S.DepartmentofDefenseMajorSharedResourceCenter’sLinuxclusters,consistingof128dualprocessornodesonadistributedmemorysystem,wheredeploymentofmpiBLAST(Darlingetal.,2003)andexecutionofOMPonseparateprocessorsisbeingtested.Inthecurrentclusterimplementation,weusempiBLASTwith32processorsrunninginparallel,which,again,consumesthebulkofthecomputingtime.10

Fig.3.IdentificationofY.pestisDNAFingerprintsUsingTOFIona32-CPULinuxCluster.

ThecomputationaltimeofthealgorithmdependsonthenumberofprobesgeneratedastheoutputofStep2andprovidedtompi-BLASTandonthesizeofthereferencedatabase.Thenumberofprobes,inturn,dependsonthelengthofthetargetgenome,theavailabilityandsimilarityofanear-neighborgenome,andtheselectedprobedesignconstraints.Thereferencedatabaseisseg-mentedaccordingtorulesofthumbsuggestedbythempiBLASTdevelopers(Darlingetal.,2003),wherethenumberofdatabasesegmentsissettothenumberofprocessors.Hence,thecomputa-tionaltimeofprocessingthereferencedatabaseisdirectlydepen-dentonthespeedupachievedbympiBLAST(Darlingetal.,2003).Theexecutiontimecouldbeimprovedbyusingotherparallelver-sionsofBLAST,suchaspioBLAST(Linetal.,2005).

Casestudy:DNAfingerprintsforY.pestis

TOFIwasusedtoidentifyDNAfingerprintsfortheplague-causingpathogenY.pestisstrainCO92(accessionno.NC_003143.1).Basedontheliterature(Chainetal.,2004),acloselyrelatedorganism,YersiniapseudotuberculosisstrainIP32953(accessionno.NC_006155.1),wasselectedasthenear-neighborgenome.

TheempiricalthresholdT¼15wasselectedbasedonapriorianalysisofhybridizationpropertiesfortheselectedmicroarraytechnologyandexperimentalsetup,suchastheprobelengthandrequiredmelting-temperature.Thisparametercanbeadjustedonacase-by-casebasisusingfeedbackfromadditionalexperimentalevaluations.ForT¼15andmaximumprobelengthLmax¼40,accordingtostep3inTable2,theminimumlengthofexactmatchesinMUMmerisM¼26.

Approximately96%oftheY.pestisgenomewasdiscardedusingMUMmerinthefirststep(Fig.3).Thustheideaofusinganear-neighborgenometoidentifyanddiscardexactmatchesprovedtobeextremelyeffectiveinthiscase.Outofabout4.6millionbasesofY.pestis,fewerthan200000bases,distributedunevenlyamong2222candidatesequences,wereconsideredfurther.Inthenextstep,slightlyover13600DNAprobessatisfyingtheexperimentalconstraintswereextractedfromthecandidatesequences.

IntheBLASTprobespecificityevaluation,theseedsizew¼7wasselectedbecauseitwasthesmallestvalueavailableinthe

Oligonucleotidefingerprintidentification

BLASTversionthatweused,andalargeE-value¼100reducedthepossibilityofmissinghighscoringalignments.

BasedonthespecifiedTOFIparameters,allbut146DNAprobeswererejected,definingtheinsilicoDNAfingerprintsthatwereselectedforexperimentalevaluationusingcustomDNAmicroar-rays.TheseDNAfingerprintsunderwentfurtherscreeningbasedonadditionalexperimentalconstraints,suchasthepresenceofrestrictionenzymecleavagesites,leavingonly99insilicofinger-printsfortesting.

Experimentalevaluationofinsilicofingerprints

TencustomizedDNAmicroarraychips,eachcontainingseveralreplicatesofthe99insilicoDNAfingerprintsandanumberofcontrolsequences,werefabricatedandusedforevaluationpur-poses.SixchipswerehybridizedwiththetargetgenomeY.pestisandfourchipswereusedtotestcross-hybridizationwiththenear-neighborgenomeY.pseudotuberculosis.Normalizeddatawereusedtocomparehybridizationsignals.

Themicroarrayhybridizationdatawereusedtoanalyzethedis-criminatingpoweroftheinsilicofingerprintsbycomparingtheexperimentalhybridizationresultsoftheprobeswithY.pestisandY.pseudotuberculosis.Figure4illustratesasamplesetofdatashow-ingthenormalizedresponse(y-axis)asafunctionoftheDNAfingerprints,whicharearrangedindescendingorderofthediffer-encebetweentheirresponseswithY.pestisandY.pseudotuberculo-sis.Variabilityinthehybridizationresponsesinrepeatedexperimentsispresentedbystandarderrorbarsforeachprobe.Outofthe99DNAfingerprintstested,20(datanotshowninFig.4)producedhigheraverageresponseforY.pseudotuberculosisthanthatforthetarget.Thisisduetocomputationalandexperi-mentalreasons.ThecomputationalreasonsrelatetolimitationsofusingBLASTforspecificityevaluation,asdiscussedinSection3,step3.Adetailedpost-experimentalanalysisoftheBLASToutputsindicatesthat12outofthe20probesdonothavereportedalign-mentswithY.pseudotuberculosisinthesignificanthitlist.Fortheremainingeightprobes,contiguousmatchesof20basesormorewereobservedintheBLASTalignmentsbutthecalculatedsumofmismatchesandgapswaslargerthan15,causingtheseprobestobeidentifiedasfingerprints.ThistypeofproblemcouldbeavoidedifathresholdforthemaximumnumberofcontiguousmatchescouldbeexperimentallydeterminedandusedforadditionalfilteringoftheprobesthatpassedthefirstBLASTspecificitytesting.Theexperimentalreasonsrelatetothevariabilityofproberesponses.Althoughallofthe20probeshavelargermeanresponsesforY.pseudotuberculosisthanthatforY.pestis,onlysixoftheseprobeshavesignificantlylargerresponses.Theexperimentalreasonfortheobservedaberranthybridizationofthesesixprobesisnotclear.These20probeswereexcludedfromfurtherevaluation.

Forafingerprinttobeusefulinadiagnosticassay,itshouldyieldaverylowresponsefornon-targetsandahighresponseforthetarget.Thus,afewDNAfingerprintsinFigure4thathaveagooddiscriminatorypowerbuthavearelativelyhighresponsefornon-targetswouldnotbeconsideredusefulondiagnosticassays.ThedatausedinFigure4canalsobeusedtoidentifyvalidfingerprintsbasedonalternaterules,suchasidentifyingquantifiablethresholdvaluesfortargetandnon-targetresponses.Forexample,25probescouldbeselectedbyusingaminimumthresholdvalueof2.0forY.pestisresponsesandamaximumthresholdvalueof1.0forY.pseudotuberculosisresponses,while20probescouldbeselected

Fig.4.ComparisonofHybridizationofinsilicoFingerprintswithTarget(Y.pestis)andNon-target(Y.pseudotuberculosis).

usingaminimumthresholdof2.0forY.pestisandallowingamaxi-mumthresholdof0.5forY.pseudotuberculosis.Ineachcase,asufficientlylargenumberofprobeswouldallowfordetectionredundancy.

OtherapplicationsofTOFI

Havinganear-neighborgenomeisnotarequirementforTOFI.Thenear-neighborgenomeisusedtoreducethesolutionsearchspaceasmuchaspossibleinthefirststep,whichiscomputationallytheleastexpensivestep.Thetargetgenomecouldbecomparedwithanysmallsetofgenomesusingsuffixtrees.Thehigherthenumberofmatchesidentifiedinthefirststepis,theloweristhenumberofcomputationsrequiredinthesubsequentsteps.Inthecurrentstudy,asinglenear-neighborcomparisonreducedthesearchspaceveryeffectively.Inthecaseinwhichacloselyrelatednear-neighborforthetargetisunknown,eitherarbitrarygenome(s)couldbeusedasnear-neighbor(s)orthefirststepcouldbeomitted.Infact,TOFIwassuccessfullyusedtoidentifyDNAfingerprintsforplasmidspPCP1,pCD1andpMT1inY.pestiswithoutusinganynear-neighbor.Becauseplasmidsaremuchshorter(aboutafewthousandbases)thanbacterialgenomes(typicallyoverafewmillionbases),thewholeplasmidcouldbeconsideredasasinglecandidatesequenceandsentdirectlyasinputtotheDNAprobedesignstep.

TOFIwasalsousedtoidentifyfingerprintsofFrancisellatularensisstrainSCHUS4(accessionno.NC_006570.1).Thegenomicsequenceofthenear-neighborFrancisellaphilomiragiawasnotavailableand,therefore,thefirststepofTOFIwasomitted.Inthesecondstep,weusedOMPtoscanthewholeF.tularensisgenome,consistingofabout1.9millionbases.Overlapofadjacentprobeswaslimitedto10basestoreducecomputationtime.OMPidentifiedabout20000probes,whichweretestedforspecificitywithT¼15,resultingin250fingerprints.Furtherscreeningforrestrictionenzymecleavagesitesreducedthenumberofinsilicofingerprintsto121.

Fourchipswerefabricatedusingseveralreplicasofthe121insilicofingerprintsandanumberofcontrolprobes.TwochipswereusedtotesthybridizationwithF.tularensisandtheothertwototestcross-hybridizationwithF.philomiragia.Weperformed

11

W.Tembeetal.

initialevaluationusingacriterionsimilartotheoneemployedforY.pestis.IncontrasttotheY.pestishybridizationexperiments,onlyoneprobehadhigheraverageresponsewithF.philomiragiathanwithF.tularensis.Intheexperiment,85probesshowedanormalizedresponsewithF.philomiragiasmallerthan1.0,while81ofthoseprobeshadresponseslargerthan2.0withF.tularensis.Currently,alargenumberofadditionalexperiments,includingastandardpanelofnon-targetgenomes,arebeingperformedtoevalu-atethefingerprintsofY.pestisandF.tularensisbeforetheyareusedasprobesindiagnosticassays.

Currentlimitationsandplansforimprovements

TOFIhasalreadybeenusedinitscurrentconfigurationtoidentifyfingerprintsforanumberofpathogens.However,severalalgo-rithmicandimplementationissuesaffectitsperformanceandarebeingaddressed.

ThescopeofDNAfingerprintidentificationinTOFIiscurrentlylimitedtoasingletargetsequence.Weareinvestigatingapproachestoselectfingerprintscommontoalargenumberofrelatedtargets,whichwouldallowfortheidentificationoffingerprintscommontospecificspeciesorgenus.ForhighlyvariableRNAviruses,uniquefingerprintsmaynotexist.Forthisapplication,anapproachbasedontheselectionofnon-uniqueprobes,whichtogethermayformuniquehybridizationpatternsonachipforunambiguousviralidentification,isalsobeingconsidered.

AlthoughwehaveprovidedadefinitionofaDNAfingerprintandanalgorithmthatguaranteesallfingerprintssatisfyingitareiden-tified,validprobescouldpotentiallybediscardedduetosignificantoverlapwithadjacentprobesduringtheprobedesignphase(Step2ofTOFI).Thisrelatestopracticalconsiderationsinordertoreducethenumberof‘similar’probesonthechip,allowspaceformultiplereplicates,andlimitthetotalnumberofprobes,consideringthatseveralcontrolsequencesneedtobepresentonthemicroarray.ExperimentalevaluationsoftheidentifiedinsilicofingerprintsforY.pestisandF.tularensisindicatethepossibilityforimprovementinthealgorithmspecificity.Itwasfoundthatcross-hybridizationwithnon-targetgenomeswasnotdetectedbyBLASTinabout10%oftheinsilicofingerprintsforY.pestis,whileinanadditional10%ofthefingerprintsthecross-hybridizationwasunderestimated.Specificitywillbeimprovedinthefuturebasedon:(1)theselectionofoptimalTOFIparametersusingcomprehensiveevaluationoftheexperimentalresults;(2)thepost-processingofBLASTalign-mentsusingexpertrulestobettercorrelatealignmentswithhybrid-izationand(3)thedevelopmentofoptimalalignmentalgorithmsthatincludehybridizationthermodynamicsasapost-processingstepaftertheBLASTspecificityevaluation.

Duetotheextremelyrapidgrowth/modificationsinavailableDNAsequences,thecontinuedvalidityoftheDNAfingerprintsmustbefrequentlyverified.Effortstoautomaticallyupdatefinger-printsarealsoplanned.

definitionofaDNAfingerprintisprovided.Moreimportantly,giventhedesiredlengthofafingerprintanditsrequirednumberofnon-matchingbasepairs,weprovideanalgorithmthatguaran-teesthatallinsilicofingerprintsareidentified.Fingerprintsforanumberofpathogenicsequenceshavebeenpreliminarilyevaluatedthroughexperimentaltestswithpathogensofinterestandnon-targetgenomes.InitialresultsindicatethattheapproachiscapableofidentifyingmultiplefingerprintsforspecificDNAsequences,andthatthealgorithmcouldbeimprovedtoenhancespecificity.Furthertesting,withastandardpanelofnon-targetgenomes,isunderway.ThisinformationwillenableoptimalTOFIparameterselectionandwillserveasavaluablebenchmarkforfuturealgorithmimprovements.

DISCLAIMER

TheopinionsorassertionscontainedhereinaretheprivateviewsoftheauthorsandarenottobeconstruedasofficialorasreflectingtheviewsoftheU.S.ArmyortheU.S.DepartmentofDefense.

ACKNOWLEDGEMENTS

Theauthorswishtoexpresstheirgratitudetotheanonymousrefereesforusefulcommentsandsuggestionsaswellasinterestingideasforfuturework.TheauthorsthankKamalKumaroftheBiotechnologyHPCSoftwareApplicationsInstituteforhelpindevelopingTOFI’sgraphicaluserinterfaceandBobStephens,JackCollinsandKarolMiaskiewiczoftheAdvancedBiomedicalComputingCenter,NationalCancerInstitute,Frederick,MD,forthecomputationalsupport.ThisworkwassponsoredbytheU.S.DepartmentofDefenseHigh-PerformanceComputingModernizationProgram(HPCMP),undertheHigh-PerformanceComputingSoftwareApplicationsInstitutes(HSAI)initiative,andtheU.S.DefenseThreatReductionAgency.ConflictofInterest:nonedeclared.

REFERENCES

Abee,T.etal.(2004)Impactofgenomicsonmicrobialfoodsafety.TrendsBiotechnol.,

22,653–660.

Altschul,S.F.etal.(1990)Basiclocalalignmentsearchtool.J.Mol.Biol.,215,

403–410.

Benson,D.A.etal.(2005)GenBank.NucleicAcidsRes.,34,D16–D20.

Chain,P.S.etal.(2004)InsightsintotheevolutionofYersiniapestisthroughwhole-genomecomparisonwithYersiniapseudotuberculosis.Proc.NatlAcad.Sci.USA,101,13826–13831.

Darling,A.etal.(2003)Thedesign,implementation,andevaluationofmpiBLAST.

In4thInternationalConferenceonLinuxClusters:TheHPCRevolution2003,inconjunctionwiththeClusterWorldConference&Expo,SanJose,CA.

Draghici,S.etal.(2005)Identificationofgenomicsignaturesforthedesignofassaysfor

thedetectionandmonitoringofanthraxthreats.InAltman,R.B.etal.(ed.),Proceed-ingsofthePacificSymposiumofBiocomputing2005Hawaii,USA,pp.248–259.Gardner,S.etal.(2004)Sequencingneedsforviraldiagnostics.J.Clin.Microbiol.,42,

5472–5476.

Gordon,P.M.K.andSensen,C.W.(2004)Osprey:acomprehensivetoolemploying

novelmethodsforthedesignofoligonucleotidesforDNAsequencingandmicroar-rays.NucleicAcidsRes.,32,e133.

Gusfield,D.(1997)AlgorithmsonStrings,Trees,andSequences:ComputerScience

andComputationalBiology.CambridgeUniversityPress,Cambridge,UK.

Hardiman,G.(ed.)(2003)MicroarrayMethodsandApplications-NutsandBolts.DNA

Press,Eagleville,PA,USA.

Hass,S.A.etal.(2003)Genome-scaledesignofPCRprimersandlongoligomersfor

DNAmicroarrays.NucleicAcidsRes.,31,5576–5581.

Ivnitski,D.etal.(2003)Nucleicacidapproachesfordetectionandidentificationof

biologicalwarfareandinfectiousdiseaseagents.Biotechniques,35,862–869.

CONCLUSIONS

ThisworkpresentedTOFI,anintegratedbioinformaticstooltoidentifyinsilicogenomicfingerprintsforthedesignofmicroarraydiagnosticassays.TOFIisastandaloneapplicationthatexploitstheparallelprogrammingbenefitsprovidedbyHPCplatformsandallowsuserstoselectinputparametersthroughagraphicaluserinterface.Thisworkdiffersfrompreviousonesinthataformal12

Oligonucleotidefingerprintidentification

Joos,T.andFortina,P.(2005)Microarraysinclinicaldiagnosis.HumanaPress,

Totowa,NJ,USA.

Kaderali,L.andSchliep,A.(2002)Selectingsignatureoligonucleotidestoidentify

organismsusingDNAarrays.Bioinformatics,18,1340–1349.

Kurtz,S.(2002)constructionandapplicationofvirtualsuffixtrees..PhDdissertation,

¨en,UniversitatBielefeld,Bielefeld,Germany.TechnischeFakulto

Kurtz,S.etal.(2004)Versatileandopensoftwareforcomparinglargegenomes.BMC

GenomeBiol.,5,R12.

Leber,M.etal.(2005)AfractionalprogrammingapproachtoefficientDNAmelting

temperaturecalculation.Bioinformatics,21,2375–2382.

Lin,H.etal.(2005)EfficientdataaccessforparallelBLAST.IEEEInternational

ParallelandDistributedProcessingSymposium,Denver,CO.

Nordberg,E.(2005)YODA:selectingsignatureoligonucleotides.Bioinformatics,21,

1365–1370.

Panjkovich,A.andMelo,F.(2005)Comparisonofdifferentmeltingtemperature

calculationmethodsforshortDNAsequences.Bioinformatics,21,711–722.Pruitt,K.D.etal.(2005)NCBIReferenceSequence(RefSeq):acuratednon-redundant

sequencedatabaseofgenomes,transcriptsandproteins.NucleicAcidsRes.,33,D501–D504.

Rahmann,S.(2003)Fastlargescaleoligonucleotideselectionusingthelongestcom-monfactorapproach.J.Bioinfo.Compu.Biol.,1,343–361.SantaLucia,J.,JrandHicks,D.(2004)ThethermodynamicsofDNAstructuralmotifs.

Annu.Rev.Biophys.Biomol.Struct.,33,415–440.

Schliep,A.etal.(2003)GrouptestingwithDNAchips:generatingdesignsanddecod-ingexperiments.InProceedingsoftheComputationalSystemsBioinformatics,August11-14,Stanford,CA,pp.84–91.

Sergeev,N.etal.(2006)MicroarrayanalysisofBacilluscereusgroupvirulencefactors.

J.Microbiol.Meth.,65,488–502.

Slezak,T.etal.(2003)Comparativegenomicstoolsappliedtobioterrorismdefense.

Brief.Bioinform.,4,133–149.

Urisman,A.etal.(2005)E-Predict:acomputationalstrategyforspeciesidentification

basedonobservedDNAmicroarrayhybridizationpatterns.BMCGenomeBiol.,6,R78.

Viljoen,G.J.etal.(eds)(2005)MolecularDiagnosticsPCRHandbook.Springer

Publishers,Berlin,Germany.

Wang,D.etal.(2002)Microarray-baseddetectionandgenotypingofviralpathogens.

Proc.NatlAcad.Sci.USA,99,15687–15692.

Weiner,P.(1973)Linearpatternmatchingalgorithms.InProceedingsof14thIEEE

AnnualSymposiumonSwitchingandAutomataTheory,Washington,DC,IEEEComputerSoc.,pp.1–11.

Willse,A.etal.(2004)Quantitativeoligonucleotidemicroarrayfingerprintingof

Salmonellaentericaisolates.NucleicAcidsRes.,32,1848–1856.

13

因篇幅问题不能全部显示,请点此查看更多更全内容