Hi, I'm having an issue where the same singularity container run by the same snakemake workflow on the same input file is working on one computer and not on another. I'm not sure what I'm doing wrong here, and I would appreciate some help troubleshooting. Thanks!
System 1 (department computer), not working:
$ singularity --version
singularity version 3.1.0-1
$ uname -a
Linux [hostname] 3.10.0-862.11.6.el7.x86_64 #1 SMP Fri Aug 10 16:55:11 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux
$ cat /etc/*release
NAME="Red Hat Enterprise Linux Server"
VERSION="7.6 (Maipo)"
ID="rhel"
ID_LIKE="fedora"
VARIANT="Server"
VARIANT_ID="server"
VERSION_ID="7.6"
PRETTY_NAME="Red Hat Enterprise Linux Server 7.6 (Maipo)"
ANSI_COLOR="0;31"
CPE_NAME="cpe:/o:redhat:enterprise_linux:7.6:GA:server"
HOME_URL="https://www.redhat.com/"
BUG_REPORT_URL="https://bugzilla.redhat.com/"
REDHAT_BUGZILLA_PRODUCT="Red Hat Enterprise Linux 7"
REDHAT_BUGZILLA_PRODUCT_VERSION=7.6
REDHAT_SUPPORT_PRODUCT="Red Hat Enterprise Linux"
REDHAT_SUPPORT_PRODUCT_VERSION="7.6"
Red Hat Enterprise Linux Server release 7.6 (Maipo)
Red Hat Enterprise Linux Server release 7.6 (Maipo)
System 2 (my desktop), working:
$ singularity --version
singularity version 3.1.0
$ uname -a
Linux tom 4.18.0-18-generic #19-Ubuntu SMP Tue Apr 2 18:13:16 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux
$ cat /etc/*release
DISTRIB_ID=Ubuntu
DISTRIB_RELEASE=18.10
DISTRIB_CODENAME=cosmic
DISTRIB_DESCRIPTION="Ubuntu 18.10"
NAME="Ubuntu"
VERSION="18.10 (Cosmic Cuttlefish)"
ID=ubuntu
ID_LIKE=debian
PRETTY_NAME="Ubuntu 18.10"
VERSION_ID="18.10"
HOME_URL="https://www.ubuntu.com/"
SUPPORT_URL="https://help.ubuntu.com/"
BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
VERSION_CODENAME=cosmic
UBUNTU_CODENAME=cosmic
The following python3 script is supposed to index a fastq file and store the index on disk.
#!/usr/bin/env python3
import Bio
from Bio import SeqIO
import sys
import logging
import sqlite3
import platform
# set up log
logging.basicConfig(
filename=snakemake.log[0],
level=logging.DEBUG)
# log environment
logging.debug('sys.version')
logging.debug(sys.version)
logging.debug('sqlite3.version')
logging.debug(sqlite3.version)
logging.debug('platform.python_implementation()')
logging.debug(platform.python_implementation())
logging.debug('platform.platform()')
logging.debug(platform.platform())
logging.debug('Bio.__version__')
logging.debug(Bio.__version__)
read_file = snakemake.input[0]
db_file = snakemake.output[0]
try:
read_index = SeqIO.index_db(db_file,
read_file,
'fastq')
except Exception as e:
logging.exception('')
raise e
The script works on System 2 but fails on System 1. The input fastq is the same file stored in the same location, although it is accessed across a cifs share for System 2.
Here's the log from system 1:
DEBUG:root:sys.version
DEBUG:root:3.7.3 (default, Apr 3 2019, 05:39:12)
[GCC 8.3.0]
DEBUG:root:sqlite3.version
DEBUG:root:2.6.0
DEBUG:root:platform.python_implementation()
DEBUG:root:CPython
DEBUG:root:platform.platform()
DEBUG:root:Linux-3.10.0-862.11.6.el7.x86_64-x86_64-with-Ubuntu-19.04-disco
DEBUG:root:Bio.__version__
DEBUG:root:1.73
ERROR:root:
Traceback (most recent call last):
File "/usr/local/lib/python3.7/dist-packages/Bio/File.py", line 732, in _build_index
con.execute("CREATE UNIQUE INDEX IF NOT EXISTS "
sqlite3.IntegrityError: UNIQUE constraint failed: offset_data.key
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/path/to/.snakemake/scripts/tmppgoypljn.index_reads.py", line 38, in <module>
'fastq')
File "/usr/local/lib/python3.7/dist-packages/Bio/SeqIO/__init__.py", line 1032, in index_db
key_function, repr)
File "/usr/local/lib/python3.7/dist-packages/Bio/File.py", line 563, in __init__
self._build_index()
File "/usr/local/lib/python3.7/dist-packages/Bio/File.py", line 738, in _build_index
raise ValueError("Duplicate key? %s" % err)
ValueError: Duplicate key? UNIQUE constraint failed: offset_data.key
And the log from system 2, which produces the expected output file:
DEBUG:root:sys.version
DEBUG:root:3.7.3 (default, Apr 3 2019, 05:39:12)
[GCC 8.3.0]
DEBUG:root:sqlite3.version
DEBUG:root:2.6.0
DEBUG:root:platform.python_implementation()
DEBUG:root:CPython
DEBUG:root:platform.platform()
DEBUG:root:Linux-4.18.0-18-generic-x86_64-with-Ubuntu-19.04-disco
DEBUG:root:Bio.__version__
DEBUG:root:1.73
Unfortunately the fastq file is 177 GB so it's not going to be possible to share.
The scripts were run in the same singularity container, which is here: https://www.singularity-hub.org/containers/8752 (shub://TomHarrop/singularity-containers:py3.7.3_biopython1.73).
The container was run from Snakemake using the following rule:
rule index_reads_37:
input:
'reads/r{r}.fq'
output:
'py37/r{r}.idx'
log:
'py37/r{r}.log'
benchmark:
'py37/r{r}_benchmark.txt'
singularity:
'shub://TomHarrop/singularity-containers:py3.7.3_biopython1.73'
script:
'src/index_reads.py'
On both machines, the snakemake command was:
snakemake --use-singularity --writable-tmpfs --nv py37/r1.idx
Maybe this is the same as #476? I've tried --no-home, --containall and --cleanenv and it still works on system 2 and fails on system 1.
I'm still getting different results on the two computers despite being doing all I can to stop python loading anything from ~/.local.
singularity exec \
-B /Volumes,${PWD} \
--nv \
-H $(mktemp -d) \
--pwd ${PWD} \
--containall \
--cleanenv \
path/to/container.sif \
script.py
I'm printing os.environ to the log so I can see where where I am in python (I snipped some paths):
DEBUG:root:environ({'LD_LIBRARY_PATH': '/.singularity.d/libs',
'LANG': 'C',
'TZ': 'Pacific/Auckland',
'SINGULARITY_APPNAME': '',
'SINGULARITY_CONTAINER': '/Volumes/[my_path]/projects/racon-chunks/.snakemake/singularity/de2d4896fe4ec4d5d88eda3065dbb926.simg',
'PYTHONNOUSERSITE': '',
'PWD': '/Volumes/[my_path]/projects/racon-chunks',
'HOME': '/Volumes/[tmp_path]/tmp.Vy058BBucG',
'TMPDIR': '/Volumes/[tmp_path]',
'TERM': 'xterm-256color',
'SINGULARITY_NAME': 'de2d4896fe4ec4d5d88eda3065dbb926.simg',
'SHLVL': '1',
'PATH': '/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/racon/build/bin',
'_': '/usr/bin/python'})
Okay, so this is actually getting a sqlite3 error on the RHEL 7 node.
sqlite3.IntegrityError: UNIQUE constraint failed: offset_data.key
ValueError: Duplicate key? UNIQUE constraint failed: offset_data.key
Does this program clean out any possible sqlite file that may exist or is it possible that there could already be a file existing on the RHEL 7 node, and it's trying to write more data into it and conflicting?
That error to me reads: I'm trying to write this value in, but the value must be unique across the entire dataset... and it's not.
That is how I understand it too.
To answer your question, the index file does not exist when the job starts on the RHEL node, so it's not writing into an already-created database.
The script is indexing a fastq file by read id. The read IDs have to be unique, because they become the key for retrieving reads from the index. I have checked and triple-checked that they are unique, and the fact that indexing works on the other computer (using the same input file) seems to confirm that.
Could it be something to do with how the different systems are handling temp files?
Could it be something to do with how the different systems are handling temp files?
Shouldn't be... But just incase, on one system is /tmp a tmpfs and on another /tmp is on disk?
Is there a difference on the storage where it's either writing out the sqlite3 file, or reading the input from, between the two machines?
That, or for some reason ... it's actually processing a dataset twice. Can you log anything before it goes into the sqlite phase to log what key it's working on, and see if you get a duplicate?
RHEL:
scratch 204T 193T 12T 95% /Volumes/scratchMy desktop:
/dev/sda2 961G 107M 912G 1% /tmpThat, or for some reason ... it's actually processing a dataset twice. Can you log anything before it goes into the sqlite phase to log what key it's working on, and see if you get a duplicate?
I'll see what I can work out for this.
Thanks for all the help on Slack. Here is the reproducible example as discussed with @vsoch
$ singularity pull \
--name biopython-index-test.sif \
shub://TomHarrop/biopython-index-test:biopython_index_test
# requires SINGULARITYENV_TMPDIR and
# SINGULARITYENV_PYTHONNOUSERSITE to be set
$ singularity run \
-B ${PWD},${TMPDIR}\
--nv \
-H $(mktemp -d) \
--pwd ${PWD} \
--containall \
--cleanenv \
biopython-index-test.sif
This works on my desktop and fails on the RHEL system.
Recipe, script and data file are all at https://github.com/TomHarrop/biopython-index-test
Meanwhile, the sysadmins gave me the following info about the /Volumes mount on the RHEL system:
They will be seen as direct mounts via cvfs. The mount is a combination of ethernet and FC where the metadata controllers over ethernet tell the system where to go to find the blocks for the file but all the data travels direct over fibre channel. The end result should be that the system just sees it like it would a local mounted drive.
hey wanted to let you know I'm actively working on this! I don't have a rhel system, but I'll try to create a debugging container for you.
hey @TomHarrop happy birthday! :) I've created a debugging container recipe and install of biopython that should help to start debug what is going on. You can read about my process here https://github.com/researchapps/biopython-index-test and specifically, the section that will let you build and use the container is here https://github.com/researchapps/biopython-index-test#back-to-singularity.
TLDR:
Report back! :)
Thanks for the awesome birthday present :+1:
When I run that on the RHEL host, everything looks fine up to the EOF bit, then this happens:
EOF
END of of FastqRandomAccess __iter__
Inserting batch of 1 offsets, K00171:456:HKGMHBBXX:5:2228:31517:49247 ... K00171:456:HKGMHBBXX:5:2228:31517:49247
length of random_access_proxies is less than max_open 10
About to index 3066301 entries
CREATE UNIQUE INDEX IF NOT EXISTS key_index ON offset_data(key);
We triggered the error, closing connection and proxies.
ERROR:root:
Traceback (most recent call last):
File "/usr/local/lib/python3.7/site-packages/Bio/File.py", line 870, in _build_index
con.execute("CREATE UNIQUE INDEX IF NOT EXISTS "
sqlite3.IntegrityError: UNIQUE constraint failed: offset_data.key
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/index_reads.py", line 38, in <module>
'fastq')
File "/usr/local/lib/python3.7/site-packages/Bio/SeqIO/__init__.py", line 1073, in index_db
key_function, repr)
File "/usr/local/lib/python3.7/site-packages/Bio/File.py", line 609, in __init__
self._build_index()
File "/usr/local/lib/python3.7/site-packages/Bio/File.py", line 880, in _build_index
raise ValueError("Duplicate key? %s" % err)
ValueError: Duplicate key? UNIQUE constraint failed: offset_data.key
Traceback (most recent call last):
File "/usr/local/lib/python3.7/site-packages/Bio/File.py", line 870, in _build_index
con.execute("CREATE UNIQUE INDEX IF NOT EXISTS "
sqlite3.IntegrityError: UNIQUE constraint failed: offset_data.key
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/index_reads.py", line 41, in <module>
raise e
File "/index_reads.py", line 38, in <module>
'fastq')
File "/usr/local/lib/python3.7/site-packages/Bio/SeqIO/__init__.py", line 1073, in index_db
key_function, repr)
File "/usr/local/lib/python3.7/site-packages/Bio/File.py", line 609, in __init__
self._build_index()
File "/usr/local/lib/python3.7/site-packages/Bio/File.py", line 880, in _build_index
raise ValueError("Duplicate key? %s" % err)
ValueError: Duplicate key? UNIQUE constraint failed: offset_data.key
And of course it works fine on my desktop - here's the output anyway.
EOF
END of of FastqRandomAccess __iter__
Inserting batch of 1 offsets, K00171:456:HKGMHBBXX:5:2228:31517:49247 ... K00171:456:HKGMHBBXX:5:2228:31517:49247
length of random_access_proxies is less than max_open 10
About to index 3066301 entries
CREATE UNIQUE INDEX IF NOT EXISTS key_index ON offset_data(key);
END of call to Bio.File._build_index
END of call to Bio.File._SQLiteManySeqFilesDict
Haha, we nailed it! So it looks like the issue is with this final index - do you understand why it's done?
About to index 3066301 entries
CREATE UNIQUE INDEX IF NOT EXISTS key_index ON offset_data(key);
We triggered the error, closing connection and proxies.
ERROR:root:
Traceback (most recent call last):
File "/usr/local/lib/python3.7/site-packages/Bio/File.py", line 870, in _build_index
con.execute("CREATE UNIQUE INDEX IF NOT EXISTS "
sqlite3.IntegrityError: UNIQUE constraint failed: offset_data.key
The snippet is here https://github.com/researchapps/biopython-index-test/blob/master/biopython-1.73-custom/Bio/File.py#L866 is this a required step, or something done to speed things up? It seems like what we would want to do is verify the file we are connecting to, and then find which index is causing the violation. I wonder if it might be one of the "manually created" ones I saw in the code. I saw in the issue you created on the biopython board they suggested doing the indexing first, so minimally we could move it earlier to speed up trigger of the error.
A few things I'm curious about:
What would be useful to do is to save the database file when it's created (and/or other timepoints) between the working and non working conditions, and then see if there are any differences. We can also write a loop that would identify the index that is repeated and see if that adds insight.
Databases do some... interesting operations on filesystems. Can you move the database off of the CVFS filesystem to a local disk on the RHEL system to see if you still get the error?
An example (guessing on versions, but should be close) is on Lustre 1.5.x sqlite DB's worked fine on it... on Lustre 1.7.x it was broke. The way lustre was handling writes was clashing with how sqlite was doing its writes. Moving the DB creation to /tmp from Lustre solved the issue.
I just tested it on a local disk, and got the same error.
EOF
END of of FastqRandomAccess __iter__
Inserting batch of 1 offsets, K00171:456:HKGMHBBXX:5:2228:31517:49247 ... K00171:456:HKGMHBBXX:5:2228:31517:49247
length of random_access_proxies is less than max_open 10
About to index 3066301 entries
CREATE UNIQUE INDEX IF NOT EXISTS key_index ON offset_data(key);
We triggered the error, closing connection and proxies.
ERROR:root:
Traceback (most recent call last):
File "/usr/local/lib/python3.7/site-packages/Bio/File.py", line 870, in _build_index
con.execute("CREATE UNIQUE INDEX IF NOT EXISTS "
sqlite3.IntegrityError: UNIQUE constraint failed: offset_data.key
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/index_reads.py", line 38, in <module>
'fastq')
File "/usr/local/lib/python3.7/site-packages/Bio/SeqIO/__init__.py", line 1073, in index_db
key_function, repr)
File "/usr/local/lib/python3.7/site-packages/Bio/File.py", line 609, in __init__
self._build_index()
File "/usr/local/lib/python3.7/site-packages/Bio/File.py", line 880, in _build_index
raise ValueError("Duplicate key? %s" % err)
ValueError: Duplicate key? UNIQUE constraint failed: offset_data.key
Traceback (most recent call last):
File "/usr/local/lib/python3.7/site-packages/Bio/File.py", line 870, in _build_index
con.execute("CREATE UNIQUE INDEX IF NOT EXISTS "
sqlite3.IntegrityError: UNIQUE constraint failed: offset_data.key
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/index_reads.py", line 41, in <module>
raise e
File "/index_reads.py", line 38, in <module>
'fastq')
File "/usr/local/lib/python3.7/site-packages/Bio/SeqIO/__init__.py", line 1073, in index_db
key_function, repr)
File "/usr/local/lib/python3.7/site-packages/Bio/File.py", line 609, in __init__
self._build_index()
File "/usr/local/lib/python3.7/site-packages/Bio/File.py", line 880, in _build_index
raise ValueError("Duplicate key? %s" % err)
ValueError: Duplicate key? UNIQUE constraint failed: offset_data.key
hey @TomHarrop I just updated the debug build so that:
When you get a chance, would you like to test?
If the PRAGMA settings don't make a difference, I asked a few questions for discussion in my last post (about Snakemake, etc.) Let me know your thoughts on those. Also - isn't it kind of strange that the exception for the index was triggered by another exception (the same one?). It makes me wonder if something racelike is going on.
Thanks, I will test the new build next week, but in the meantime:
- do you trigger the error with all inputs, of type fastq (so we can rule out the file)
- does it trigger with another kind of input type (so we can say it's generalizable to the base class)
I've only tried to index this fastq and its corresponding r2 file (along with various subsets). Another variation I tried was replacing the read IDs with integers (i.e. replaced those K00171:456:HKGMHBBXX:5:2228:31517:49247 strings with 1, 2, 3 .... 600000000-odd) but that didn't help.
- you mentioned running with snakemake - could you comment on that? That might introduce multiprocessing or something that would look like writing to the same file.
In general I am running this workflow with snakemake, but this step is single threaded. Anyhow, I have been launching the test containers directly from the shell, and snakemake isn't inside the container.
Thanks @TomHarrop, next week is good! I'll be around before that if you get bored and wind up doing it :)
Have a great weekend! :avocado:
Right, below is the message from the new container (I snipped some paths and the Inserting batch of 100 offsets messages).
I will also try some other files and file types and report.
nohup: ignoring input
DEBUG:root:sys.version
DEBUG:root:3.7.3 (default, May 8 2019, 05:28:42)
[GCC 6.3.0 20170516]
DEBUG:root:sqlite3.version
DEBUG:root:2.6.0
DEBUG:root:platform.python_implementation()
DEBUG:root:CPython
DEBUG:root:platform.platform()
DEBUG:root:Linux-3.10.0-862.11.6.el7.x86_64-x86_64-with-debian-9.9
DEBUG:root:Bio.__version__
DEBUG:root:1.73
DEBUG:root:os.environ
DEBUG:root:environ({'PYTHON_PIP_VERSION': '19.1.1', 'LD_LIBRARY_PATH': '/.singularity.d/libs', 'HOME': '/path/to/tmp/tmp.ZbOjbkDZrk', 'GPG_KEY': '0D96DF4D4110E5C43FBFB17F2D347EA6AA65421D', 'PS1': 'Singularity> ', 'PYTHONNOUSERSITE': '', 'TMPDIR': '/path/to/tmp', 'TERM': 'xterm-256color', 'PATH': '/usr/local/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin', 'LANG': 'C', 'SINGULARITY_APPNAME': '', 'PYTHON_VERSION': '3.7.3', 'SINGULARITY_CONTAINER': '/path/to/tmp/biopython-debug.sif', 'PWD': '/path/to/tmp', 'SINGULARITY_NAME': 'biopython-debug.sif', 'TZ': 'Pacific/Auckland', 'LC_CTYPE': 'C.UTF-8'})
START of call to index_db in biopython.Bio.SeqIO
index_filename is /path/to/tmp/tmpnrl4ctit/r1.idx
filenames are /r1.fq
format is fastq
alphabet is None
key_function is None
filenames variable was a basestring, putting into list
imported _FormatToRandomAccess, a dict with lookups being the formats
dict_keys(['ace', 'embl', 'fasta', 'fastq', 'fastq-sanger', 'fastq-solexa', 'fastq-illumina', 'genbank', 'gb', 'ig', 'imgt', 'phd', 'pir', 'sff', 'sff-trim', 'swiss', 'tab', 'qual', 'uniprot-xml'])
imported Bio.File._SQLiteManySeqFilesDict, Read only dictionary interface to many sequential record files.
repr of inputs (repr) is SeqIO.index_db('/path/to/tmp/tmpnrl4ctit/r1.idx', filenames=['/r1.fq'], format='fastq', alphabet=None, key_function=None)
END of of index_db, will return call to _SQLiteManySeqFilesDict
START of call to Bio.File._SQLiteManySeqFilesDict
index_filename is /path/to/tmp/tmpnrl4ctit/r1.idx
proxy_factory is <function index_db.<locals>.proxy_factory at 0x7fb0a674e1e0>
filenames are ['/r1.fq']
key_function is None
format is fastq
repr is SeqIO.index_db('/path/to/tmp/tmpnrl4ctit/r1.idx', filenames=['/r1.fq'], format='fastq', alphabet=None, key_function=None)
max_open is 10 (could this be an issue?)
filenames was likely generator, turning to list
Relative path of index_filename is /path/to/tmp/tmpnrl4ctit
index_filename needs to be built (no file)
START of call to Bio.File._build_index
START of proxy_factory, format:fastq, filename:None
filename is provided, returning _FormatToRandomAccess
connecting to sqlite database /path/to/tmp/tmpnrl4ctit/r1.idx
creating database with the following:
PRAGMA synchronous=ON
PRAGMA locking_mode=EXCLUSIVE
This is where the unique indexing was commented out, adding it back in.
CREATE TABLE offset_data (key TEXT PRIMARY KEY, file_number INTEGER, offset INTEGER, length INTEGER);
About to issue the following commands:
CREATE TABLE meta_data (key TEXT, value TEXT);
INSERT INTO meta_data (key, value) VALUES (?,?); ("count", -1)
INSERT INTO meta_data (key, value) VALUES (?,?); ("format", fastq)
INSERT INTO meta_data (key, value) VALUES (?,?); ("filenames_relative_to_index", "True")
CREATE TABLE file_data (file_number INTEGER, name TEXT);
CREATE TABLE offset_data (key TEXT, file_number INTEGER, offset INTEGER, length INTEGER);
Starting enumeration through filenames ['/r1.fq']
storing '/r1.fq' as ['/path/to/tmp/tmpnrl4ctit'] '/r1.fq'
INSERT INTO file_data (file_number, name) VALUES (?,?); (0, xxx)
START of proxy_factory, format:fastq, filename:/r1.fq
filename isn't providing, returning _FormatToRandomAccess[fastq]
INIT of base class SeqFileRandomAccess
filename is /r1.fq
format is fastq
alphabet is None
Opening /r1.fq for random access
START to Bio.File._open_for_random_access
filename is /r1.fq
magic is b'@M'
seeked to start of file
END to Bio.File._open_for_random_access, returning handle <_io.BufferedReader name='/r1.fq'>
Loading parsing class/function once to avoid dict lookup in each __getitem__ call
Loaded <function FastqPhredIterator at 0x7fb0a0f9f378>
Alphabet is None, note there is note here that code is 'nasty'
random_access_proxy is <Bio.SeqIO._index.FastqRandomAccess object at 0x7fb0a0eaea20>
key_function is None
IMPORTANT: We are about to iterate through offsets
START of of FastqRandomAccess __iter__
Seeked to start of handle <_io.BufferedReader name='/r1.fq'>
Inserting batch of 100 offsets, MG00HS20:1017:CAK56ANXX:6:1101:12714:2153 ... MG00HS20:1017:CAK56ANXX:6:1101:1925:2736
# MANY MORE LINES
Inserting batch of 100 offsets, K00171:456:HKGMHBBXX:5:2228:18862:48808 ... K00171:456:HKGMHBBXX:5:2228:29934:49212
EOF
END of of FastqRandomAccess __iter__
Inserting batch of 1 offsets, K00171:456:HKGMHBBXX:5:2228:31517:49247 ... K00171:456:HKGMHBBXX:5:2228:31517:49247
length of random_access_proxies is less than max_open 10
Previously indexed entries here, now should skip, finding it exists.
CREATE UNIQUE INDEX IF NOT EXISTS key_index ON offset_data(key);
We triggered the error, closing connection and proxies.
ERROR:root:
Traceback (most recent call last):
File "/usr/local/lib/python3.7/site-packages/Bio/File.py", line 877, in _build_index
con.execute("CREATE UNIQUE INDEX IF NOT EXISTS "
sqlite3.IntegrityError: UNIQUE constraint failed: offset_data.key
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/index_reads.py", line 38, in <module>
'fastq')
File "/usr/local/lib/python3.7/site-packages/Bio/SeqIO/__init__.py", line 1073, in index_db
key_function, repr)
File "/usr/local/lib/python3.7/site-packages/Bio/File.py", line 609, in __init__
self._build_index()
File "/usr/local/lib/python3.7/site-packages/Bio/File.py", line 887, in _build_index
raise ValueError("Duplicate key? %s" % err)
ValueError: Duplicate key? UNIQUE constraint failed: offset_data.key
Traceback (most recent call last):
File "/usr/local/lib/python3.7/site-packages/Bio/File.py", line 877, in _build_index
con.execute("CREATE UNIQUE INDEX IF NOT EXISTS "
sqlite3.IntegrityError: UNIQUE constraint failed: offset_data.key
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/index_reads.py", line 41, in <module>
raise e
File "/index_reads.py", line 38, in <module>
'fastq')
File "/usr/local/lib/python3.7/site-packages/Bio/SeqIO/__init__.py", line 1073, in index_db
key_function, repr)
File "/usr/local/lib/python3.7/site-packages/Bio/File.py", line 609, in __init__
self._build_index()
File "/usr/local/lib/python3.7/site-packages/Bio/File.py", line 887, in _build_index
raise ValueError("Duplicate key? %s" % err)
ValueError: Duplicate key? UNIQUE constraint failed: offset_data.key
Here's how it looks on my desktop:
DEBUG:root:sys.version
DEBUG:root:3.7.3 (default, May 8 2019, 05:28:42)
[GCC 6.3.0 20170516]
DEBUG:root:sqlite3.version
DEBUG:root:2.6.0
DEBUG:root:platform.python_implementation()
DEBUG:root:CPython
DEBUG:root:platform.platform()
DEBUG:root:Linux-4.18.0-18-generic-x86_64-with-debian-9.9
DEBUG:root:Bio.__version__
DEBUG:root:1.73
DEBUG:root:os.environ
DEBUG:root:environ({'PYTHON_PIP_VERSION': '19.1.1', 'LD_LIBRARY_PATH': '/.singularity.d/libs', 'HOME': '/tmp/tmp.vl6aT6Jpd3', 'GPG_KEY': '0D96DF4D4110E5C43FBFB17F2D347EA6AA65421D', 'PS1': 'Singularity> ', 'PYTHONNOUSERSITE': '', 'TMPDIR': '/tmp', 'TERM': 'xterm-256color', 'PATH': '/usr/local/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin', 'LANG': 'C', 'SINGULARITY_APPNAME': '', 'PYTHON_VERSION': '3.7.3', 'SINGULARITY_CONTAINER': '/path/to/img/biopython-index-test/biopython-debug.sif', 'PWD': '/path/to/img/biopython-index-test', 'SINGULARITY_NAME': 'biopython-debug.sif', 'LC_CTYPE': 'C.UTF-8'})
START of call to index_db in biopython.Bio.SeqIO
index_filename is /path/to/img/biopython-index-test/tmp6851ycc7/r1.idx
filenames are /r1.fq
format is fastq
alphabet is None
key_function is None
filenames variable was a basestring, putting into list
imported _FormatToRandomAccess, a dict with lookups being the formats
dict_keys(['ace', 'embl', 'fasta', 'fastq', 'fastq-sanger', 'fastq-solexa', 'fastq-illumina', 'genbank', 'gb', 'ig', 'imgt', 'phd', 'pir', 'sff', 'sff-trim', 'swiss', 'tab', 'qual', 'uniprot-xml'])
imported Bio.File._SQLiteManySeqFilesDict, Read only dictionary interface to many sequential record files.
repr of inputs (repr) is SeqIO.index_db('/path/to/img/biopython-index-test/tmp6851ycc7/r1.idx', filenames=['/r1.fq'], format='fastq', alphabet=None, key_function=None)
END of of index_db, will return call to _SQLiteManySeqFilesDict
START of call to Bio.File._SQLiteManySeqFilesDict
index_filename is /path/to/img/biopython-index-test/tmp6851ycc7/r1.idx
proxy_factory is <function index_db.<locals>.proxy_factory at 0x7fcb4d57b378>
filenames are ['/r1.fq']
key_function is None
format is fastq
repr is SeqIO.index_db('/path/to/img/biopython-index-test/tmp6851ycc7/r1.idx', filenames=['/r1.fq'], format='fastq', alphabet=None, key_function=None)
max_open is 10 (could this be an issue?)
filenames was likely generator, turning to list
Relative path of index_filename is /path/to/img/biopython-index-test/tmp6851ycc7
index_filename needs to be built (no file)
START of call to Bio.File._build_index
START of proxy_factory, format:fastq, filename:None
filename is provided, returning _FormatToRandomAccess
connecting to sqlite database /path/to/img/biopython-index-test/tmp6851ycc7/r1.idx
creating database with the following:
PRAGMA synchronous=ON
PRAGMA locking_mode=EXCLUSIVE
This is where the unique indexing was commented out, adding it back in.
CREATE TABLE offset_data (key TEXT PRIMARY KEY, file_number INTEGER, offset INTEGER, length INTEGER);
About to issue the following commands:
CREATE TABLE meta_data (key TEXT, value TEXT);
INSERT INTO meta_data (key, value) VALUES (?,?); ("count", -1)
INSERT INTO meta_data (key, value) VALUES (?,?); ("format", fastq)
INSERT INTO meta_data (key, value) VALUES (?,?); ("filenames_relative_to_index", "True")
CREATE TABLE file_data (file_number INTEGER, name TEXT);
CREATE TABLE offset_data (key TEXT, file_number INTEGER, offset INTEGER, length INTEGER);
Starting enumeration through filenames ['/r1.fq']
storing '/r1.fq' as ['/path/to/img/biopython-index-test/tmp6851ycc7'] '/r1.fq'
INSERT INTO file_data (file_number, name) VALUES (?,?); (0, xxx)
START of proxy_factory, format:fastq, filename:/r1.fq
filename isn't providing, returning _FormatToRandomAccess[fastq]
INIT of base class SeqFileRandomAccess
filename is /r1.fq
format is fastq
alphabet is None
Opening /r1.fq for random access
START to Bio.File._open_for_random_access
filename is /r1.fq
magic is b'@M'
seeked to start of file
END to Bio.File._open_for_random_access, returning handle <_io.BufferedReader name='/r1.fq'>
Loading parsing class/function once to avoid dict lookup in each __getitem__ call
Loaded <function FastqPhredIterator at 0x7fcb4d664378>
Alphabet is None, note there is note here that code is 'nasty'
random_access_proxy is <Bio.SeqIO._index.FastqRandomAccess object at 0x7fcb4d5749b0>
key_function is None
IMPORTANT: We are about to iterate through offsets
START of of FastqRandomAccess __iter__
Seeked to start of handle <_io.BufferedReader name='/r1.fq'>
Inserting batch of 100 offsets, MG00HS20:1017:CAK56ANXX:6:1101:12714:2153 ... MG00HS20:1017:CAK56ANXX:6:1101:1925:2736
# MANY MORE LINES
Inserting batch of 100 offsets, K00171:456:HKGMHBBXX:5:2228:18862:48808 ... K00171:456:HKGMHBBXX:5:2228:29934:49212
EOF
END of of FastqRandomAccess __iter__
Inserting batch of 1 offsets, K00171:456:HKGMHBBXX:5:2228:31517:49247 ... K00171:456:HKGMHBBXX:5:2228:31517:49247
length of random_access_proxies is less than max_open 10
Previously indexed entries here, now should skip, finding it exists.
CREATE UNIQUE INDEX IF NOT EXISTS key_index ON offset_data(key);
END of call to Bio.File._build_index
END of call to Bio.File._SQLiteManySeqFilesDict
okay, we can determine that the pragma variable isn't causing the issue. Let's try the other file types and see if that helps. I don't have time tonight to take a look, but that should give some time to test on the other files!
Here's a diff for the two logs above.