Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Couldn't automatically detect the sequence identifier field in the fastq id string. #701

Closed
Sebastian-Mynott opened this issue Mar 6, 2019 · 11 comments

Comments

@Sebastian-Mynott
Copy link

Hi,

I'm looking at sequence data downloaded from the NCBI SRA database. When running filterAndTrim I get he following error:

Error in (function (fn, fout, maxN = c(0, 0), truncQ = c(2, 2), truncLen = c(0,  : 
  Couldn't automatically detect the sequence identifier field in the fastq id string.

After looking at the source code I tried inserting a dummy identifier, so instead of the identifier reading @SRR9876543.1 1/1, it would read @M012345:SRR9876543.1 1/1, but this didn't work.

Could you give me a suggestion how I can get around this?

Many thanks.

@benjjneb
Copy link
Owner

benjjneb commented Mar 6, 2019

What is the output of head -n4 mysrr_file.fastq (in the shell)?

What command did you use to convert from sra format to fastq? i.e. the fastq-dump arguments.

@Sebastian-Mynott
Copy link
Author

Aha! I downloaded the files using package SRAdb getSRAfile(SRAccessions, sra_con, fileType = 'fastq' ) which gave me a list of .fastq.gz files so I didn't think I'd need fast-dump.

the output of head -n4 mysrr_file.fastq gives me this:
@SRR7758019.1 1/1 GCGGTAATTCCAGCTCCAATAGCGTATATTAAAGTTGTTGCGGTTAAAAAGCTCGTAGTTGGATTTCTGCTGAGGACGACCGGTCCGCCCTCTNNNNNNNNNTNNNNCTCGGCNTTGGCATCTTCTTGGGGAACGTNANTGCACTTGACTGTGTGGTGCGGTATCCAGGACTTTTACTTTGAGGNNNNNNNNGTGNNNCAANCNGGCTTACGCCTTGAATACATTAGCATGGAATAATAAGATAGGACCTTGGTTCTATTTNNTTGGNNNNNNNNGCTGAGGTNATGATTACTAGGGATAG + CCCCCGGGGGGGGGEGGFGGGGGGGGGGGGGGFGFGGGGFGGGGGGGGGGGGFGDFFFGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG#########:####::DFGG#:BFGGGGGGGGGGGGGGGFGGG#:#:BFFGGGGGGFGGGGFGGGGGGGGGGGGGG7FGGGGGGGGGGFGG########56>###66=#6#6*;CFCGFGGGGGGFFGGGGGGGGDFG0776CAF7FF?7+??FGG6CC?C5D?GGGG##228*########0--1<CG4#--(4;A>4-5=FF**9*

Do I need to download the files again as SRA then convert to fastq?

@benjjneb
Copy link
Owner

benjjneb commented Mar 6, 2019

Do I need to download the files again as SRA then convert to fastq?

I would at least try that on one file to see if that fixes this issue.

@kelseysumner
Copy link

Hi, I wanted to re-open this because I am having a similar issue. I'm using paired-end sequence data sequenced on Illumina MiSeq and also downloaded from the NCBI SRA database. I downloaded the files originally as SRA files and then converted them to zipped fastq files (fastq.gz) using fastq-dump with a flag to make sure each sample had separate files for the forward and reverse reads.

I'm getting the same error when I run DADA2 on these files:
Error in (function (fn, fout, maxN = c(0, 0), truncQ = c(2, 2), truncLen = c(0, :
Couldn't automatically detect the sequence identifier field in the fastq id string.
Calls: filterAndTrim ... mclapply -> lapply -> FUN -> .mapply ->
Execution halted

The head of my one of my fastq files I'm reading into DADA2 looks like this:
@SRR1191781.12854 12854 length=250
TTATTAATCCTATTGAACTATTTACGACATTAAACACACTGGAACATTTTTCCATTTTACAAATTTTTTTTTCAATATCATTTGCATAATCTAATTGGTCTTTAGGTTTATTAGCAGAGCCAGGTTTTATTCTAACTTGAATACCATTTCCACAAGTTACACTACATGGGGACCATTCAGTTGAAAGAGAATTTTGTATTGTCTTTAAATATTTTTCTATGTGCT
+
HHHHHHHHHHHFHHHHHHHHHHHHHGGFEFFHHHHHHHGHHHHHHHHHHHHHHHHHHHHHH5FGHHHGG>EGHHHHHHHHHHHGHBHHFHHHGDGHHHHHGGHHHHGHHFHHHGHFBFEGHFHH2BFGHGGHHHHHHHGGHHHHHHHHHHHGHHHHG1GHFHHGHHHHHEGGGGHHHHHGGHFHHBGGBCGHHHFHGGHGHFFHHHHHHGHHHHFGGGGGGFFGF

Do you know what might be going on and how I could fix this issue?

@benjjneb
Copy link
Owner

benjjneb commented Sep 3, 2019

This error is because the original fastq id lines have been replaced by these SRA id lines, which filterAndTrim(..., matchIDs=TRUE) doesn't recognize.

Do you need to use the matchIDs=TRUE flag? If you don't, just remove it and everything should work fine.

@kelseysumner
Copy link

Thank you for the quick reply. It looks like that solved the issue!

@d-callan
Copy link

I'm having a similar problem with the SRA id lines, except i do require the matchIDs = TRUE flag. What then?

@benjjneb
Copy link
Owner

@d-callan Unfortunately I'm not sure if there is a solutions in that case. The original IDs are required to match the paired reads together if they are now in different orders.

@d-callan
Copy link

thanks anyhow. I'm not convinced they are truly ordered differently. but im finding there are definitely differing number of read counts for forward and reverse. perhaps i can put together a script quickly to remove those reads which dont have a partner before passing to dada2 and see where that gets me. was mostly just hoping i might not have to..

@dbro970
Copy link

dbro970 commented Feb 28, 2023

thanks anyhow. I'm not convinced they are truly ordered differently. but I'm finding there are definitely differing number of read counts for forward and reverse. perhaps I can put together a script quickly to remove those reads which don't have a partner before passing to dada2 and see where that gets me. was mostly just hoping I might not have to..

Hi apologies for resurrecting an old thread, I was just wondering if you managed to find a solution to this? as I've found myself in the same situation

@wygbio
Copy link

wygbio commented Dec 7, 2023

I am also meeting a similar issue with the head of fastq files. They were obtained by Illumina MiSeq, not downloaded from the NCBI SRA database. Is there something wrong with the head that can't be detected?
@HWI-D00433:728:HHHKHBCX2:2:1101:8032:2352.1:N:0--D13a_C25.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants