Azure-sdk-for-python: list_blobs only list 5000 elements

Created on 19 Jun 2015  ยท  3Comments  ยท  Source: Azure/azure-sdk-for-python

Hi all,

I am trying to download a "folder" inside of a blob container keeping the original tree folder structure. This "folder" contains millions of files.
For this reason first I run list_blobs just to obtain the list of blobs and then download the blobs using get_blob_to_path.

blobs = blob_service.list_blobs('blob_container','data/projects/folder')
for blob in blobs:
print(blob.name)

this function only shows the first 5000 blobs inside of data/projects/folder, but as I said, I have millions of files.

Any idea why this loop only shows the first 5000 elements?
Any other suggestion to download millions of files inside of a blob container?

Thanks in advance and best regards?

Service Attention Storage question

Most helpful comment

Prefix and delimiter are utility parameters for fetching file lists. For example let's say you have a directory structure like this:

.
โ”œโ”€โ”€ data
โ”‚ย ย  โ”œโ”€โ”€ archive
โ”‚ย ย  โ”‚ย ย  โ”œโ”€โ”€ old_file1.csv
โ”‚ย ย  โ”‚ย ย  โ””โ”€โ”€ old_file2.csv
โ”‚ย ย  โ”œโ”€โ”€ file1.csv
โ”‚ย ย  โ”œโ”€โ”€ file2.csv
โ”‚ย ย  โ””โ”€โ”€ file3.csv
โ””โ”€โ”€ logs
    โ””โ”€โ”€ log.txt

If you set the prefix to 'data', you will get:

data/file1.csv
data/file2.csv
data/file3.csv
data/archive/old_file1.csv
data/archive/old_file2.csv

But if you also set the delimiter to '/', you will only get:

data/file1.csv
data/file2.csv
data/file3.csv

In this example, I used prefix to target a specific directory, and a delimiter of '/' to specify that I only want the files in that directory (and not subdirectories and their contents below it).

Many applications of these parameters, but this is what I use them for anyways.

All 3 comments

You'll need to set a marker for containers with > 5k entries. here is a code snippet I used:

marker = None
while True:
  results = blob_service.list_blobs('blob_container',marker=marker, prefix=prefix, delimiter=delimiter)
  #...do stuff with results ...
  if results.next_marker:
    marker = results.next_marker
  else:
    break

Basically you set the initial marker to None to start at the beginning, then loop until a result set does not return a pointer to a new marker.

Hi,

Really, thanks a lot, is working now. Altough I am not using "prefix" and "delimiter" since I am not sure what is the purpose of those two parameters.

Best regards,

Prefix and delimiter are utility parameters for fetching file lists. For example let's say you have a directory structure like this:

.
โ”œโ”€โ”€ data
โ”‚ย ย  โ”œโ”€โ”€ archive
โ”‚ย ย  โ”‚ย ย  โ”œโ”€โ”€ old_file1.csv
โ”‚ย ย  โ”‚ย ย  โ””โ”€โ”€ old_file2.csv
โ”‚ย ย  โ”œโ”€โ”€ file1.csv
โ”‚ย ย  โ”œโ”€โ”€ file2.csv
โ”‚ย ย  โ””โ”€โ”€ file3.csv
โ””โ”€โ”€ logs
    โ””โ”€โ”€ log.txt

If you set the prefix to 'data', you will get:

data/file1.csv
data/file2.csv
data/file3.csv
data/archive/old_file1.csv
data/archive/old_file2.csv

But if you also set the delimiter to '/', you will only get:

data/file1.csv
data/file2.csv
data/file3.csv

In this example, I used prefix to target a specific directory, and a delimiter of '/' to specify that I only want the files in that directory (and not subdirectories and their contents below it).

Many applications of these parameters, but this is what I use them for anyways.

Was this page helpful?
0 / 5 - 0 ratings