Tensor2tensor: Tracking bug for issues in Wikisum

Created on 30 Apr 2018  Â·  18Comments  Â·  Source: tensorflow/tensor2tensor

Description

I can currently not sign the Contributor License Agreement, so I will not do a pull request, sorry about this.

If you are unable to process any of these patches without a pull-request, I will make one at the end of the week. But for the sake of getting these notes out as soon as possible, for now, this is all I can offer.

Notes on shell commands

(https://github.com/tensorflow/tensor2tensor/tree/master/tensor2tensor/data_generators/wikisum#commands-to-generate-wikisumweb)

  • Command "python -m tensor2tensor.data_generators.wikisum.parallel_launch" needs to go without '.py'

  • --command_prefix can't find the scripts, needs to do a cd first (will break --code_dir feature, but this seems to have no effect on the other commands anyway, as they directly call the python module)
    --command_prefix="cd ~/.local/lib/python3.5/site-packages/tensor2tensor/data_generators/ ; python3 -m tensor2tensor.data_generators.wikisum.get_references_web --out_dir={{BUCKET}}/wiki_references --shard_id"

  • If "Cloud Storage JSON API" is not enabled yet in the gcloud account, it will fail silently on the cloud worker. Should be checked in parallel_launch.py first

  • Vocab-generation on Python3 throws an re exception

       File "/usr/local/lib/python3.5/dist-packages/tensor2tensor/data_generators/wikisum/wikisum.py", line 361, in _normalize_text
        text = re.sub("[%s]" % re.escape(string.punctuation), r" \g<0> ", text)
      File "/usr/lib/python3.5/re.py", line 182, in sub
        return _compile(pattern, flags).sub(repl, string, count)
    TypeError: cannot use a string pattern on a bytes-like object
    
  • Vocab-generation does not work on windows, as it can not access gs:// files, so it should be done in the cloud as well (explitictly using python2 to work around the re exception mentioned above)

python2 -m tensor2tensor.data_generators.wikisum.parallel_launch \
  --num_instances=1 \
  --cpu=4 --mem=16 \
  --name=wikisum-vocab-gen \  
  --setup_command="pip install tensor2tensor tensorflow -U -q --user" \
  --command_prefix="python2 -m tensor2tensor.data_generators.wikisum.generate_vocab  --out_dir=$BUCKET/data  --refs_dir=$BUCKET/wiki_references ; echo Done shard

Changes for parallel_launch.py

Here is the full parallel_launch.py file https://pastebin.com/UUDjA9jL

Below are the changes I made:

https://github.com/tensorflow/tensor2tensor/blob/master/tensor2tensor/data_generators/wikisum/parallel_launch.py#L46 - Cleaning output

#FIX prevent a ton of future warnings from h5py
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)

https://github.com/tensorflow/tensor2tensor/blob/master/tensor2tensor/data_generators/wikisum/parallel_launch.py#L88 - Some changes to the shell commands

#FIX added username 'wsumuser', otherwise got an error as my windows-username is root
COPY_CODE = "gcloud compute scp --recurse {local_dir} wsumuser@{instance_name}:~/"
SSH = "gcloud compute ssh wsumuser@{instance_name} --command"
DEFAULT_ZONE = "gcloud config get-value compute/zone"
#FIX use screens logging functionality to get rid of escaping problems for window's popen and to get more log coverage
SCREEN = "screen -L ~/logs-XXX.txt -dmS test bash -c \"{command}\""
#FIX no need for piping anymore, using screen -L
LOGS = "; gsutil cp ~/logs-XXX.txt {bucket}logs-{task_id}.txt"

https://github.com/tensorflow/tensor2tensor/blob/master/tensor2tensor/data_generators/wikisum/parallel_launch.py#L105 - Find gcloud on windows

     #FIX try to find gcloud on windows
      try:
        return sp.check_call(args)
      except FileNotFoundError:
        if args[0] == "gcloud": args[0] = os.getenv('LOCALAPPDATA') + "/Google/Cloud SDK/google-cloud-sdk/bin/gcloud.cmd"
        return sp.check_call(args)

https://github.com/tensorflow/tensor2tensor/blob/master/tensor2tensor/data_generators/wikisum/parallel_launch.py#L115 - Use python instead of netcat to wait for SSH

def wait_for_ssh(ip):
  """Wait for SSH to be available at given IP address."""
  # FIX don't use netcat, but python to test for open SSH port
  import socket
  for i in range(12):
    s = socket.socket()
    s.settimeout(2)

    try:
      s.connect((ip, 22))
      s.close()
      return True
    except:
      s.close()
    time.sleep(10)
  return False

https://github.com/tensorflow/tensor2tensor/blob/master/tensor2tensor/data_generators/wikisum/parallel_launch.py#L209 - Python3 compatibility
vm_names = list(zip(*vm_info))[0] if vm_info else [] #FIX: make a list from zip()-iterator first

https://github.com/tensorflow/tensor2tensor/blob/master/tensor2tensor/data_generators/wikisum/parallel_launch.py#L216 - Log-Dir handling on windows

    #FIX window's \ leads to an error on the cloud-server afterwards
    log_dir = log_dir.replace("\\","/")

    #FIX on windows there is no gs:// , so give the user the opportunity to create the directories
    from tensorflow.python.framework.errors_impl import UnimplementedError
    try:
      tf.gfile.MakeDirs(log_dir)
    except UnimplementedError:
      input("[!] Use http://console.cloud.google.com and manually create '" + log_dir + "', then press return.................")

https://github.com/tensorflow/tensor2tensor/blob/master/tensor2tensor/data_generators/wikisum/parallel_launch.py#L251 - Don't fail silently (the zip-exception, for example, was silenced)


    # FIX Don't fail silently
    except Exception as e:  # pylint: disable=bare-except
      failed.append(i)
      tf.logging.error("Failed to launch task %d due to exception %s", i, str(e))

Changes for cloud_tpu.py

https://github.com/tensorflow/tensor2tensor/blob/master/tensor2tensor/utils/cloud_tpu.py#L229 - Find gcloud on windows

def shell_output(cmd_, **kwargs):
  try:
    return text_encoder.to_unicode(sp.check_output(format_cmd(cmd_, **kwargs)))
  except FileNotFoundError:
    return text_encoder.to_unicode(sp.check_output(format_cmd_win(cmd_, **kwargs)))

def shell_run(cmd_, **kwargs):
  try:
    return sp.check_call(format_cmd(cmd_, **kwargs))
  except FileNotFoundError:
    return sp.check_call(format_cmd_win(cmd_, **kwargs))

def format_cmd_win(cmd_, **kwargs):
  ret = cmd_.format(**kwargs).strip().split()
  if ret[0] == "gcloud": ret[0] = os.getenv('LOCALAPPDATA') + "/Google/Cloud SDK/google-cloud-sdk/bin/gcloud.cmd"
  elif ret[0] == "gsutil": ret[0] = os.getenv('LOCALAPPDATA') + "/Google/Cloud SDK/google-cloud-sdk/bin/gsutil.cmd"
  return ret

TensorFlow and tensor2tensor versions

tensor2tensor 1.6.1
tensorflow-gpu 1.7.0

bug

All 18 comments

Thanks very much @f-lng!

I'll address some of the bugs, though I'm not keen on adding Windows support currently. I'll have to think more about it since it adds a bit of complexity.

Were you able to generate the dataset in the end? Web or Commoncrawl or both? Were the time and cost estimates reasonably accurate?

@rsepassi Yes, I figured that windows is not very high on the list, and it's no problem for me, but perhaps someone will find the notes usefull anyway. However, as most stuff happens in the cloud, it is not that much of a problem to adjust it accordingly. This is no "run and forget" generator anyway.

I am getting the wikisumWeb dataset.

I did not create it yet, I had to wait until yesterday to get my quota increase, so right now I am only as far as downloading all the references.

However, I did test the other parts on a small subset already, and I am positive it will work out in the end.

Tough, my version of wikisum is now a bit different from yours (due to changes because of windows, but also due to changes in how I want my dataset to be generated [using a bigram overlap per sentence, instead of tfidf per paragraph, for example]).


Additional notes:

Reliability of SSH commands

As soon as I try to spawn all 1k instances at once, even tough I reduced the threads to 10, I do get immediate and frequent SSH timeouts (they are retried, and mostly work out, but still).

What works better is to have a loop that calls parallel_launch.py with chunks of 10 using --instance_ids

Checks

Also somehow related, I think there is the need for some additional checks after the download has been completed, in order to check if every shard has been downloaded correctly.
I will write one that parses the logs in the google bucket, but it will look quite messy (on windows I have to download all of the logs first, for example). Should I still post it here?

FYI: 24 instances failed for some reason in my run

Text normalization

Still not sure if it was a neccessary decision because of the capabilities of the transformer, but I think it should not include lowering the text, as - as far as I know - lowering the text on a 32k bilingual corpus is not necessary, so I think it is not necessary here as well. What do you think?

code_dir upload

I think it should be uploaded directly into "~/.local/lib/python{python_version}/site-packages/tensor2tensor/data_generators/" , to make sure all imports use the new version and the "python -m" command work out in the same way, too

Timeout

Reducing the timeout to 10 seconds saves 20% of computational costs, with neglectable decrease of coverage.
(Note: I only tested coverage on the first shard for both timeouts.)

(Here https://github.com/tensorflow/tensor2tensor/blob/master/tensor2tensor/data_generators/wikisum/get_references_web_single_group.py#L209 )

Costs

The costs for downloading the web-references were ~500€ (including the timeout reduction, having most workers only run for 4 hours)

Compute Engine | Custom instance Core running in Americas | 658,50 Day | 429,86 €
Compute Engine | Custom instance Ram running in Americas | 659,19 Gibibyte-day | 57,67 €
Compute Engine | Network Internet Egress from Americas to Americas | 106,40 Gibibyte | 10,37 €

You might add to the cost estimate that the 10TB of data will cost ~160€ per month to store

Time estimates for final generation step

@rsepassi

For me it does not take 8 hours but only 16 minutes to generate the final traindata.
I tested on the first 10 shards, here is the log of the first 3 : https://pastebin.com/K3YxjuXk

The generated datafiles are ~200mb, so the 200gb for the final dataset seems about right.
They hold about 2.2k datapoints (ref->wiki) each, so in total I expect 2.2mio datapoints. Does that seem about right as well?

Deletion bug

If you start instances with --instance_ids , the deletion-on-failure routine mixes up the instance's index with its actual ID/Name, and tries to delete the wrong instances

Here you can see the problematic code https://pastebin.com/fuL2UZFw (see the comments)

Thank you so much for the further reports @f-lng. Keep them coming!

For the SSH reliability: I think the best thing to do is to try a full launch of the 1k and then do a relaunch with --instance_ids for the ones where the SSH command failed.

Checks: I tried to make the scripts that are launched on the remote machines as robust as possible so if some of your jobs failed after launch, please report the error messages here.

Text normalization: it may not be strictly necessary, but it's what we did for the paper so we kept it the same.

code_dir upload: If you use the code_dir feature, you'll want to update the setup command. For example, you can make code_dir the entire local tensor2tensor directory and change the setup command to pip install -e . --user so that it makes all the from tensor2tensor... imports use the local copy.

Timeout: that's great. I'll reduce it to 10 and update the cost.

Time and costs: Yes, it took <30m for the final step for me as well. I must have based the initial estimate on a shard running on my local machine instead of on GCP so the data transfer was the bottleneck. Will update the time and cost estimates provided.

Deletion bug: Good find! Will update.

@rsepassi Sure thing, but I think these were all the bugs I will find, as I am currently generating the final dataset and it looks like it all works without problems now.
I will give you an update as soon as I have trained a model with it, tough.

SSH reliability: Indeed that might have been a better approach, it takes a lot of time to spawn them in chunks of 10.

Checks: I think the problems came from time-outed SSH commands, so nothing you can fix I guess. As a very simple check, perhaps this snippet is enough:

rerun_shards = set()
ret = shell_output("gsutil ls {bucket}", bucket=BUCKET + "/wiki_references/process_0")

for i in range(1000):
    if ret.find(".gz-"+str(i).rjust(5, '0')+"-of-01000") == -1:
        rerun_shards.add(i)

print("INSTANCES_REDO =" , list(rerun_shards) )

Text normalization: Alright, good to know, thanks!

code_dir upload: I fixed it by uploading the wikisum/-directory directly into the pip installation, so I did not have to upload the entire tensor2tensor-directory.

Time and costs: Good to know. So all in all, the costs were pretty ok, especially given the 250€ you get with a new google-cloud account.

Good luck on your experiments!
On Thu, May 3, 2018 at 12:07 PM Fabian Langer notifications@github.com
wrote:

@rsepassi https://github.com/rsepassi Sure thing, but I think these
were all the bugs I will find, as I am currently generating the final
dataset. I will give you an update as soon as I have trained a model with
it, tough.

SSH reliability: Indeed that might have been a better approach, it takes a
lot of time to spawn them in chunks of 10.

Checks: I think the problems came from time-outed SSH commands, so nothing
you can fix I guess. As a very simple check, perhaps this snippet is enough:

rerun_shards = set()
ret = shell_output("gsutil ls {bucket}", bucket=BUCKET + "/wiki_references/process_0")

for i in range(1000):
if ret.find(".gz-"+str(i).rjust(5, '0')+"-of-01000") == -1:
rerun_shards.add(i)

print("INSTANCES_REDO =" , list(rerun_shards) )

Text normalization: Alright, good to know, thanks!

code_dir upload: I fixed it by uploading the wikisum/-directory directly
into the pip installation, so I did not have to upload the entire
tensor2tensor-directory.

Time and costs: Good to know. So all in all, the costs were pretty ok,
especially given the 250€ you get with a new google-cloud account.

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/tensorflow/tensor2tensor/issues/757#issuecomment-386404014,
or mute the thread
https://github.com/notifications/unsubscribe-auth/ABEGW560mOcKfxdbOSPL4NnodgjK-ebHks5tu1VbgaJpZM4TsUiW
.

For others that see this bug, the fixes will be out with the next T2T release sometime next week.

Just started the first training run @rsepassi and noticed the following things

eval_steps

--eval_steps=100

Seems a bit low, especially given the huge amount of evaluation data. Looks like you would spend most time on evaluation with that setting.

hparams

There should be a note on setting the L parameter, as the paper mentions that the base transformer is not able to learn for L ~ 2000 (and also mentions the best range for L on lead section generation).
So perhaps a custom hparam set for wikisum would be a good idea.

@registry.register_hparams
def transformer_big_wsum_v1():
  hparams = transformer_big_single_gpu() # for best results
  hparams.batch_size = 2500 # maximum for 1080TI
  hparams.max_input_seq_length = 500  # setting L
  hparams.max_target_seq_length = 0 # never truncate EOS

  return hparams

(Is that the right parameter to set?)

Yes, I’ll add a note on the length parameter. Hopefully we’ll have better
repro instructions on the next push.
On Fri, May 4, 2018 at 2:24 AM Fabian Langer notifications@github.com
wrote:

Just started the first training run @rsepassi
https://github.com/rsepassi and noticed the following things
eval_steps

--eval_steps=100
Seems a bit low, especially given the huge amount of evaluation data.
Looks like you would spend most time on evaluation with that setting.

hparams

There should be a note on setting the L parameter, as the paper mentions
that the base transformer is not able to learn for L ~ 2000 (and also
mentions the best range for L on lead section generation)

hparams.max_input_seq_length = 500
hparams.max_target_seq_length = 0

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/tensorflow/tensor2tensor/issues/757#issuecomment-386548131,
or mute the thread
https://github.com/notifications/unsubscribe-auth/ABEGW0sm84fTkss6nBcYK6fScT2aQA4Pks5tvB4wgaJpZM4TsUiW
.

Deleted because I found something odd, now investigating

@rsepassi I found a problem that will prevent training on wikisum data at all.

debug_keep_up

If the flag is set, the script checks for num_instances instead of the actual requested instance_ids to be 1. However, using --instance_ids you could set one specific instance to debug and still have num_instances at 1000, so len(instances_ids) should be checked.

So https://github.com/tensorflow/tensor2tensor/blob/master/tensor2tensor/data_generators/wikisum/parallel_launch.py#L248 should be
assert len(instance_ids) == 1

Failing generation due to out-of-memory

About 25 instances fail during example generation due to running out of memory.

I did not find the bug previously, as the according failure message only shows up in the log sometimes (mostly the process is killed silently).

But due to the next bug, I had to look into it and found that, in order to make sure all datafiles are generated, you need 3gb of RAM instead of 2.

Silent failing if not all datafiles are there

_Deleted, assumption was wrong and it is working as expected_

Update:

Ok, it seems that - while the bug of not enough RAM exists - my conclusion of its impact was wrong. The reason for the incomplete evaluation was that I had still set "eval_drop_long_sequences=true".

The problem is, that might not help me determine why my first training run did overfit so badly. Will investigate further and report back.

Training behaviour

@rsepassi
Alright, overfitting is no longer happening and the training behaviour looks good to me with:

@registry.register_hparams
def transformer_base_wsum_v1():
  hparams = transformer_base_v1()

  hparams.max_input_seq_length = 500
  hparams.max_target_seq_length = 0
  hparams.max_length = 500

  return hparams

However, I am not sure if the metrics are within expected ranges, could you check against your own training runs? https://pasteboard.co/Hk2M9A9.png

Collaps of model output

And while the model still seems to learn nicely I found something odd - and I am not sure if that is expected during early training (20k steps and 1.5 days in) or a sign for a problem: model output collapsed to one "best guess", and the model (4 beams, alpha 0.6) always outputs (more or less) the same sentence, regardless of the input.

Here you find an example decode from the dataset:
https://pastebin.com/a540WiZ2

Thanks for the further notes @f-lng.

I'll fix the debug_keep_up bug and up produce_examples to 3G of memory.

I'll also add a data validation script and instructions for running it to the README.

On the issue of training stopping when a file is missing, I don't understand where that behavior is coming from and so am skeptical that that's indeed what you're seeing. To find files to read, it uses a filepattern, so it shouldn't matter if there are missing ones. Could you provide some more of your observations that leads you to believe that it's stopping because of a missing file?

@rsepassi No you are right, it was just an assumption that I made based on the fact that evaluation stopped at 60/100 and the first missing file was in that range as well.
But the assumption was wrong, it did not stop because a file was missing. It simply skipped the data due to "eval_drop_long_sequences=true"

--
@rsepassi @lukaszkaiser

What is more of an issue now is the actual training.

Can one of you say anything about the expected training behaviour, my hyperparameters and the model collaps that I am seeing? Is it expected after 2.5 days or is something broken?

I am now 2.5 days in, and output seems to have collapsed to a single sentence, no matter what input the network gets.

Metrics are still looking OK I guess : https://pasteboard.co/Hkc3lcQ.png

For completeness, here are the used hparams again (learning the lead section problem)

@registry.register_hparams
def transformer_base_wsum_v1():
hparams = transformer_base_v1()

hparams.max_input_seq_length = 500
hparams.max_target_seq_length = 0
hparams.max_length = 500

return hparams

Hi @f-lng @rsepassi,

I am trying to use the Transformer-Decoder-Only model from the "Generating Wikipedia..." paper for my own dataset.

I see that this thread points to instructions for generating the WikiSum dataset but I didn't see instructions for running the model as described in "Generating Wikipedia..." ?

Also, are there any guidelines on how I should format my dataset so that the T2T Transformer-Decoder-Only model can consume it?

Closing this bug as data generation now works. Thanks @f-lng for the bug finding!

For training, I believe these settings produce reasonable results. We usually trained on 8 GPUs or a TPU:

model = 'transformer'
hparams = "max_input_seq_length=$MAX_LEN"
hparams_set = 'transformer_base'
decode_hparams = 'beam_size=4,alpha=0.6,batch_size=1'

Was this page helpful?
0 / 5 - 0 ratings