FTP list fails with large number of file #57

horkko · 2016-08-12T10:24:26Z

Hi,

I'm facing a problem with a bank that download a lots of files.
I'm trying to get files from Genbank WGS (ftp://ftp.ncbi.nlm.nih.gov/genbank/wgs).
This directory contains around 84,000 files. Then when I run biomaj, I always get this error:

[ftp.py:list:277] Could not get errcode:(56, 'FTP response reading failed')

It somehow mean that the ftp reponse is longer than expected to retrieve the list of files.
I've try to set some options like (FTP_RESPONSE_TIME) but no success.
So my question is, do you have any clue on how to avoid such problem?
The problem is similar using Firefox, listing wgs directory ends with a blank page.
However, using ncftp, command dir succeed but we need to wait around a minute to get the file list.

Thanks

Emmanuel

The text was updated successfully, but these errors were encountered:

osallou · 2016-08-12T11:06:50Z

hum, I did not face the issue. I would also have looked at timeout issue, but if it does not solve the problem I don't know.
Maybe it is a different timeout, not a response time but a connect time or something like that. I will have a look next week.

osallou · 2016-08-12T11:09:07Z

Could you try with curl directly with option --trace trace.txt ? I saw same issue on internet about sftp servers, and not ftp.

horkko · 2016-08-12T12:29:34Z

Yes me too. But the NCBI site is not sftp :(
Here is the command:
curl --trace-ascii trace.txt --use-ascii ftp://ftp.ncbi.nlm.nih.gov/genbank/wgs/

Here is the trace.txt output:

== Info: About to connect() to ftp.ncbi.nlm.nih.gov port 21 (#0)
== Info:   Trying 130.14.250.13... == Info: connected
== Info: Connected to ftp.ncbi.nlm.nih.gov (130.14.250.13) port 21 (#0)
<= Recv header, 6 bytes (0x6)
0000: 220-
<= Recv header, 18 bytes (0x12)
0000:  Warning Notice!
<= Recv header, 3 bytes (0x3)
0000:  
<= Recv header, 77 bytes (0x4d)
0000:  You are accessing a U.S. Government information system which in
0040: cludes this
<= Recv header, 66 bytes (0x42)
0000:  computer, network, and all attached devices. This system is for
<= Recv header, 80 bytes (0x50)
0000:  Government-authorized use only. Unauthorized use of this system
0040:  may result in
<= Recv header, 77 bytes (0x4d)
0000:  disciplinary action and civil and criminal penalties. System us
0040: ers have no
<= Recv header, 80 bytes (0x50)
0000:  expectation of privacy regarding any communications or data pro
0040: cessed by this
<= Recv header, 72 bytes (0x48)
0000:  system. At any time, the government may monitor, record, or sei
0040: ze any
<= Recv header, 73 bytes (0x49)
0000:  communication or data transiting or stored on this information 
0040: system.
<= Recv header, 6 bytes (0x6)
0000:  ---
<= Recv header, 90 bytes (0x5a)
0000:  Welcome to the NCBI ftp server! The anonymous access URL is ftp
0040: ://ftp.ncbi.nlm.nih.gov/
<= Recv header, 3 bytes (0x3)
0000:  
<= Recv header, 102 bytes (0x66)
0000:  Public data may be downloaded by logging in as "anonymous" usin
0040: g your E-mail address as a password.
<= Recv header, 3 bytes (0x3)
0000:  
<= Recv header, 85 bytes (0x55)
0000:  Please see ftp://ftp.ncbi.nlm.nih.gov/README.ftp for hints on l
0040: arge file transfers
<= Recv header, 23 bytes (0x17)
0000: 220 FTP Server ready.
=> Send header, 16 bytes (0x10)
0000: USER anonymous
<= Recv header, 75 bytes (0x4b)
0000: 331 Anonymous login ok, send your complete email address as your
0040:  password
=> Send header, 22 bytes (0x16)
0000: PASS [email protected]
<= Recv header, 50 bytes (0x32)
0000: 230 Anonymous access granted, restrictions apply
=> Send header, 5 bytes (0x5)
0000: PWD
<= Recv header, 34 bytes (0x22)
0000: 257 "/" is the current directory
== Info: Entry path is '/'
=> Send header, 13 bytes (0xd)
0000: CWD genbank
<= Recv header, 28 bytes (0x1c)
0000: 250 CWD command successful
=> Send header, 9 bytes (0x9)
0000: CWD wgs
<= Recv header, 28 bytes (0x1c)
0000: 250 CWD command successful
=> Send header, 6 bytes (0x6)
0000: EPSV
== Info: Connect data stream passively
<= Recv header, 48 bytes (0x30)
0000: 229 Entering Extended Passive Mode (|||50241|)
== Info:   Trying 130.14.250.13... == Info: connected
== Info: Connecting to 130.14.250.13 (130.14.250.13) port 50241
=> Send header, 8 bytes (0x8)
0000: TYPE A
<= Recv header, 19 bytes (0x13)
0000: 200 Type set to A
=> Send header, 6 bytes (0x6)
0000: LIST
<= Recv header, 54 bytes (0x36)
0000: 150 Opening ASCII mode data connection for file list
== Info: Maxdownload = -1
<= Recv data, 0 bytes (0x0)
== Info: Remembering we are in dir "genbank/wgs/"
== Info: FTP response reading failed
== Info: Connection #0 to host ftp.ncbi.nlm.nih.gov left intact
=> Send header, 6 bytes (0x6)
0000: QUIT
== Info: FTP response reading failed
== Info: Closing connection #0

osallou · 2016-08-12T13:33:27Z

I think it expects to start receiving something within X seconds and cancel
if timeout reached.

Le ven. 12 août 2016 14:29, Emmanuel Quevillon [email protected] a
écrit :

Yes me too. But the NCBI site is not sftp :(
Here is the command:
curl --trace-ascii trace.txt --use-ascii
ftp://ftp.ncbi.nlm.nih.gov/genbank/wgs/

Here is the trace.txt output:

== Info: About to connect() to ftp.ncbi.nlm.nih.gov port 21 (#0)
== Info: Trying 130.14.250.13... == Info: connected
== Info: Connected to ftp.ncbi.nlm.nih.gov (130.14.250.13) port 21 (#0)
<= Recv header, 6 bytes (0x6)
0000: 220-
<= Recv header, 18 bytes (0x12)
0000: Warning Notice!
<= Recv header, 3 bytes (0x3)
0000:
<= Recv header, 77 bytes (0x4d)
0000: You are accessing a U.S. Government information system which in
0040: cludes this
<= Recv header, 66 bytes (0x42)
0000: computer, network, and all attached devices. This system is for
<= Recv header, 80 bytes (0x50)
0000: Government-authorized use only. Unauthorized use of this system
0040: may result in
<= Recv header, 77 bytes (0x4d)
0000: disciplinary action and civil and criminal penalties. System us
0040: ers have no
<= Recv header, 80 bytes (0x50)
0000: expectation of privacy regarding any communications or data pro
0040: cessed by this
<= Recv header, 72 bytes (0x48)
0000: system. At any time, the government may monitor, record, or sei
0040: ze any
<= Recv header, 73 bytes (0x49)
0000: communication or data transiting or stored on this information
0040: system.
<= Recv header, 6 bytes (0x6)
0000: ---
<= Recv header, 90 bytes (0x5a)
0000: Welcome to the NCBI ftp server! The anonymous access URL is ftp
0040: ://ftp.ncbi.nlm.nih.gov/
<= Recv header, 3 bytes (0x3)
0000:
<= Recv header, 102 bytes (0x66)
0000: Public data may be downloaded by logging in as "anonymous" usin
0040: g your E-mail address as a password.
<= Recv header, 3 bytes (0x3)
0000:
<= Recv header, 85 bytes (0x55)
0000: Please see ftp://ftp.ncbi.nlm.nih.gov/README.ftp for hints on l
0040: arge file transfers
<= Recv header, 23 bytes (0x17)
0000: 220 FTP Server ready.
=> Send header, 16 bytes (0x10)
0000: USER anonymous
<= Recv header, 75 bytes (0x4b)
0000: 331 Anonymous login ok, send your complete email address as your
0040: password
=> Send header, 22 bytes (0x16)
0000: PASS [email protected]
<= Recv header, 50 bytes (0x32)
0000: 230 Anonymous access granted, restrictions apply
=> Send header, 5 bytes (0x5)
0000: PWD
<= Recv header, 34 bytes (0x22)
0000: 257 "/" is the current directory
== Info: Entry path is '/'
=> Send header, 13 bytes (0xd)
0000: CWD genbank
<= Recv header, 28 bytes (0x1c)
0000: 250 CWD command successful
=> Send header, 9 bytes (0x9)
0000: CWD wgs
<= Recv header, 28 bytes (0x1c)
0000: 250 CWD command successful
=> Send header, 6 bytes (0x6)
0000: EPSV
== Info: Connect data stream passively
<= Recv header, 48 bytes (0x30)
0000: 229 Entering Extended Passive Mode (|||50241|)
== Info: Trying 130.14.250.13... == Info: connected
== Info: Connecting to 130.14.250.13 (130.14.250.13) port 50241
=> Send header, 8 bytes (0x8)
0000: TYPE A
<= Recv header, 19 bytes (0x13)
0000: 200 Type set to A
=> Send header, 6 bytes (0x6)
0000: LIST
<= Recv header, 54 bytes (0x36)
0000: 150 Opening ASCII mode data connection for file list
== Info: Maxdownload = -1
<= Recv data, 0 bytes (0x0)
== Info: Remembering we are in dir "genbank/wgs/"
== Info: FTP response reading failed
== Info: Connection #0 to host ftp.ncbi.nlm.nih.gov left intact
=> Send header, 6 bytes (0x6)
0000: QUIT
== Info: FTP response reading failed
== Info: Closing connection #0

—
You are receiving this because you commented.

Reply to this email directly, view it on GitHub
#57 (comment), or mute
the thread
https://github.com/notifications/unsubscribe-auth/AA-gYoksvcHOufEUam_ruIHo39c4U4z9ks5qfGcugaJpZM4Ji9ai
.

horkko · 2016-08-12T13:39:33Z

Yes that what I suspected, but i could not find any documentation on this using pycurl.
It exists an option FTP_RESPONSE_TIMEOUT in curl library, but not in pycurl.
I tried to set it in ftp.list as self.crl.setopt(pycurl.FTP_RESPONSE_TIMEOUT, 300). No warning from biomaj, but did not succeed in listing.

By the way, I've tried with pdb, which download more than 100,000 files. The listing does not fails!!
So, I've discovered a small difference between the 2 output logs:

Genbank

> CWD wgs
< 250 CWD command successful
> EPSV
* Connect data stream passively
< 229 Entering Extended Passive Mode (|||50205|)
*   Trying 130.14.250.12... * connected
* Connecting to 130.14.250.12 (130.14.250.12) port 50205
> TYPE A
< 200 Type set to A
> LIST
< 150 Opening ASCII mode data connection for file list

PDB

< 250K. Current directory is /pub/pdb/derived_data
> PASV
* Connect data stream passively
< 227 Entering Passive Mode (165,230,17,202,197,113)
*   Trying 165.230.17.202... * connected
* Connecting to 165.230.17.202 (165.230.17.202) port 50545
> LIST
< 150 Accepted data connection

The only diff I see is the mode, PASV for PDB and EPSV for Genbank. It could be a clue?

EDIT: I've try to disable EPSV mode for ftp self.crl.setopt(pycurl.FTP_USE_EPSV, 0) but it has no effect :(

osallou · 2016-08-12T14:22:45Z

Passive vs active should not be issue. This makes pb usually when going
through firewalls.
Pycurl provide sam libcurl options.
The issue is the time to get the start of the list. Pdb id quite immediate.
Seems their server has issue to return the listing (why so long).

Le ven. 12 août 2016 15:39, Emmanuel Quevillon [email protected] a
écrit :

Yes that what I suspected, but i could not find any documentation on this
using pycurl.
It exists an option FTP_RESPONSE_TIMEOUT in curl library, but not in
pycurl.
I tried to set it in ftp.list as self.crl.setopt(pycurl.FTP_RESPONSE_TIMEOUT,
300). No warning from biomaj, but did not succeed in listing.

By the way, I've tried with pdb, which download more than 100,000 files.
The listing does not fails!!
So, I've discovered a small difference between the 2 output logs:

Genbank

CWD wgs
< 250 CWD command successful
EPSV

Connect data stream passively
< 229 Entering Extended Passive Mode (|||50205|)

Trying 130.14.250.12... * connected

Connecting to 130.14.250.12 (130.14.250.12) port 50205
TYPE A
< 200 Type set to A
LIST
< 150 Opening ASCII mode data connection for file list

PDB

< 250K. Current directory is /pub/pdb/derived_data

PASV

Connect data stream passively
< 227 Entering Passive Mode (165,230,17,202,197,113)

Trying 165.230.17.202... * connected

Connecting to 165.230.17.202 (165.230.17.202) port 50545
LIST
< 150 Accepted data connection

The only diff I see is the mode, PASV for PDB and EPSV for Genbank. It
could be a clue?

—
You are receiving this because you commented.

Reply to this email directly, view it on GitHub
#57 (comment), or mute
the thread
https://github.com/notifications/unsubscribe-auth/AA-gYl2Pgfh-cFC4c4hcxPgKviIjJjWJks5qfHeVgaJpZM4Ji9ai
.

osallou · 2016-08-12T14:26:19Z

Did you try setting CURLOPT_TIMEOUT just like for download step? (and set param in config)

horkko · 2016-08-12T14:33:35Z

CURLOPT_TIMEOUT is already set in ftp.list

     self.crl.setopt(pycurl.CONNECTTIMEOUT, 300)
     # Download should not take more than 5minutes
        self.crl.setopt(pycurl.TIMEOUT, self.timeout)
        self.crl.setopt(pycurl.NOSIGNAL, 1)

which refers to workflow.py

        timeout_download = self.bank.config.get('timeout.download')
        if timeout_download is not None and timeout_download:
            downloader.timeout = int(timeout_download)

Even if I increase this value, it has no effect :(

horkko · 2016-08-16T12:53:18Z

Hi,

Maybe a clue to fix this problem. Using curl option CURLOPT_DIRLISTONLY partially solves the problem.

...
< 200 Type set to A
> NLST
< 150 Opening ASCII mode data connection for file list
* Maxdownload = -1
* Remembering we are in dir "genbank/wgs/"
< 226 Transfer complete
* Connection #0 to host ftp.ncbi.nlm.nih.gov left intact
...

At least the dir listing is available, however, we fail later in the workflow as this cul option only list the directory content, is does not retrieve metadata such as permissions, date, size etc...
So, for a bank having a release.file set, it should not be a problem, but for bank which base its release number on date of last updated file, then we end with such error:

2016-08-16 14:47:55,917 ERROR [root][MainThread] [workflow.py:start:135] Workflow:downloadException:'year'

which the build of the release based on last updated files :(
Does this new option is good start to solve the problem?

osallou · 2016-08-16T13:06:55Z

we need all metadata, so it is not good :-(

horkko · 2016-08-16T13:08:30Z

Yes I know, unless we can combine such bank (with huge file list) with a release file number.

osallou · 2016-08-16T13:11:42Z

this is a workaround for specific bank, and it is not even sure it will work 100%.

horkko · 2016-08-16T13:12:12Z

yeah you're right :(

osallou · 2016-08-16T13:13:24Z

could you share the bank ini file?

horkko · 2016-08-16T13:22:11Z

Here are the info for Genbank WGS

protocol=ftp
server=ftp.ncbi.nlm.nih.gov
remote.dir=/genbank/wgs/
remote.files=^wgs\.\w{4}[\.\d]*\.g(np|bff)\.gz$

osallou · 2016-08-16T14:53:01Z

I am trying option TCP_KEEPALIVE, which needs pycurl/curl version >= 7.25.0.
I reached default timeout (5minutes), I will try higher value to see if I can get something.

horkko · 2016-08-16T14:57:16Z

Ok. For info I've update my pycurl from 7.19 to 7.43 today. But it did not change anything compared to original problem.
Let me know about this new option.

osallou · 2016-08-16T15:09:08Z

not better, but error (56, 'response reading failed) occurs between 1min and more ( occured at 5 minutes), it depends.... so it depends on remote server.

horkko · 2016-08-16T15:11:44Z

For info, using ncftp with command line, does not report error 56. And the first output of the directory listing appears after about a minute.

osallou · 2016-08-16T15:23:10Z

does ncftp report all metadata ?

horkko · 2016-08-16T15:32:27Z

I dont think so I dont remember actually

Le 16 août 2016 17:23, "Olivier Sallou" [email protected] a écrit :

does ncftp report all metadata ?

—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
#57 (comment), or mute
the thread
https://github.com/notifications/unsubscribe-auth/AI2lZWdIdCpPAFjkAJyV-woP2DiUdllSks5qgdXfgaJpZM4Ji9ai
.

osallou · 2016-08-16T15:33:49Z

maybe it acts like CURLOPT_DIRLISTONLY

horkko · 2016-08-16T15:42:03Z

probably :(

Le 16 août 2016 17:34, "Olivier Sallou" [email protected] a écrit :

maybe it acts like CURLOPT_DIRLISTONLY

—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
#57 (comment), or mute
the thread
https://github.com/notifications/unsubscribe-auth/AI2lZVIrIDIo0X9DLbrh35gINyfyVkqxks5qgdhdgaJpZM4Ji9ai
.

horkko · 2016-08-19T14:27:44Z

Hi Olivier,

Back on the problem. We've found the source of the problem. It is not related to pycurl or even libcurl itself. The directory listing works well when we only ask for the name of the file(s) in the remote directory (ftp command NLST instead of LIST).
As soon as we ask for related metadata (time, size, permissions), then the time taken from
the server to build the list is greater than a certain amount of time from when the remote ftp server close the connection with a FIN-ACK on the command channel as well as the data channel.
That's why we get an error (56, 'FTP response reading failed')
Hope that help.

Emmanuel

osallou · 2016-08-19T14:39:04Z

Nice analysis. Maybe you should contact upstream ftp maintainer to raise
the issue and solve it.
Biomaj needs metadata, and beyond this, this is an issue for any user with
browsers.

Le ven. 19 août 2016 16:27, Emmanuel Quevillon [email protected] a
écrit :

Hi Olivier,

Back on the problem. We've found the source of the problem. It is not
related to pycurl or even libcurl itself. The directory listing works
well when we only ask for the name of the file(s) in the remote directory
(ftp command NLST instead of LIST).
As soon as we ask for related metadata (time, size, permissions), then the
time taken from
the server to build the list is greater than a certain amount of time from
when the remote ftp server close the connection with a FIN-ACK on the
command channel as well as the data channel.
That's why we get an error (56, 'FTP response reading failed')
Hope that help.

Emmanuel

—
You are receiving this because you commented.

Reply to this email directly, view it on GitHub
#57 (comment), or mute
the thread
https://github.com/notifications/unsubscribe-auth/AA-gYvNbZY7IfvymMVN4BBFBkQLq7shEks5qhb1ggaJpZM4Ji9ai
.

horkko · 2016-08-19T14:45:00Z

Thanks :)
I have already contacter ncbi support for this I am waiting for their
reply.
And yes you are right the directory listing is not possible with a web
browser :(

Le 19 août 2016 16:39, "Olivier Sallou" [email protected] a écrit :

Nice analysis. Maybe you should contact upstream ftp maintainer to raise
the issue and solve it.
Biomaj needs metadata, and beyond this, this is an issue for any user with
browsers.

Le ven. 19 août 2016 16:27, Emmanuel Quevillon [email protected]
a
écrit :

Hi Olivier,

Back on the problem. We've found the source of the problem. It is not
related to pycurl or even libcurl itself. The directory listing works
well when we only ask for the name of the file(s) in the remote directory
(ftp command NLST instead of LIST).
As soon as we ask for related metadata (time, size, permissions), then
the
time taken from
the server to build the list is greater than a certain amount of time
from
when the remote ftp server close the connection with a FIN-ACK on the
command channel as well as the data channel.
That's why we get an error (56, 'FTP response reading failed')
Hope that help.

Emmanuel

—
You are receiving this because you commented.

Reply to this email directly, view it on GitHub
#57 (comment),
or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AA-
gYvNbZY7IfvymMVN4BBFBkQLq7shEks5qhb1ggaJpZM4Ji9ai>
.

—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
#57 (comment), or mute
the thread
https://github.com/notifications/unsubscribe-auth/AI2lZbVcFlepKLATpC01s1VWRAH2sRbzks5qhcAJgaJpZM4Ji9ai
.

osallou added the bug label Aug 12, 2016

osallou added question and removed bug labels Aug 19, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FTP list fails with large number of file #57

FTP list fails with large number of file #57

horkko commented Aug 12, 2016

osallou commented Aug 12, 2016

osallou commented Aug 12, 2016

horkko commented Aug 12, 2016

osallou commented Aug 12, 2016

horkko commented Aug 12, 2016 •

edited

Loading

osallou commented Aug 12, 2016

osallou commented Aug 12, 2016

horkko commented Aug 12, 2016

horkko commented Aug 16, 2016

osallou commented Aug 16, 2016

horkko commented Aug 16, 2016

osallou commented Aug 16, 2016

horkko commented Aug 16, 2016

osallou commented Aug 16, 2016

horkko commented Aug 16, 2016

osallou commented Aug 16, 2016

horkko commented Aug 16, 2016

osallou commented Aug 16, 2016

horkko commented Aug 16, 2016

osallou commented Aug 16, 2016

horkko commented Aug 16, 2016

osallou commented Aug 16, 2016

horkko commented Aug 16, 2016

horkko commented Aug 19, 2016

osallou commented Aug 19, 2016

horkko commented Aug 19, 2016

FTP list fails with large number of file #57

FTP list fails with large number of file #57

Comments

horkko commented Aug 12, 2016

osallou commented Aug 12, 2016

osallou commented Aug 12, 2016

horkko commented Aug 12, 2016

osallou commented Aug 12, 2016

horkko commented Aug 12, 2016 • edited Loading

osallou commented Aug 12, 2016

osallou commented Aug 12, 2016

horkko commented Aug 12, 2016

horkko commented Aug 16, 2016

osallou commented Aug 16, 2016

horkko commented Aug 16, 2016

osallou commented Aug 16, 2016

horkko commented Aug 16, 2016

osallou commented Aug 16, 2016

horkko commented Aug 16, 2016

osallou commented Aug 16, 2016

horkko commented Aug 16, 2016

osallou commented Aug 16, 2016

horkko commented Aug 16, 2016

osallou commented Aug 16, 2016

horkko commented Aug 16, 2016

osallou commented Aug 16, 2016

horkko commented Aug 16, 2016

horkko commented Aug 19, 2016

osallou commented Aug 19, 2016

horkko commented Aug 19, 2016

horkko commented Aug 12, 2016 •

edited

Loading