Bert: Has anyone done with preprocessor for "Quora Question duplicate, QQP dataset"?

Created on 27 Nov 2018  路  6Comments  路  Source: google-research/bert

Just a quick question:
Has anyone done with preprocessor for "Quora Question duplicate, QQP dataset"?
If not I'll write a preprocessor for QQP Dataset in GLUE.

Most helpful comment

just in case anyone needs this: almost same as MRPC except the line index.

class QuoraProcessor(DataProcessor):
    """Processor for Quora Question duplicate dataset (GLUE Version)"""
    def get_train_examples(self, data_dir):
      return self._create_examples(
        self._read_tsv(os.path.join(data_dir, "train.tsv")), "train" )

    def get_dev_examples(self, data_dir):
      return self._create_examples(
        self._read_tsv(os.path.join(data_dir, "dev.tsv")) , "dev" )

    def get_test_examples(self, data_dir):
      return self._create_examples(
        self._read_tsv(os.path.join(data_dir, "test.tsv")), "test" )

    def get_labels(self):
      return ["0", "1"]

    def _create_examples(self, lines, set_type):
     examples = []
     for (i, line) in enumerate(lines):
       if i == 0 :
         continue
       if len(line) == 6:
         guid = "%s - %s" % (set_type, i)
         question_a = tokenization.convert_to_unicode(line[3])
         question_b = tokenization.convert_to_unicode(line[4])
         if set_type == "test":
           label = "0"
         else:
           label = tokenization.convert_to_unicode(line[5])
         examples.append(
           InputExample(guid=guid, text_a=question_a, text_b=question_b, label=label))
     return examples 

All 6 comments

just in case anyone needs this: almost same as MRPC except the line index.

class QuoraProcessor(DataProcessor):
    """Processor for Quora Question duplicate dataset (GLUE Version)"""
    def get_train_examples(self, data_dir):
      return self._create_examples(
        self._read_tsv(os.path.join(data_dir, "train.tsv")), "train" )

    def get_dev_examples(self, data_dir):
      return self._create_examples(
        self._read_tsv(os.path.join(data_dir, "dev.tsv")) , "dev" )

    def get_test_examples(self, data_dir):
      return self._create_examples(
        self._read_tsv(os.path.join(data_dir, "test.tsv")), "test" )

    def get_labels(self):
      return ["0", "1"]

    def _create_examples(self, lines, set_type):
     examples = []
     for (i, line) in enumerate(lines):
       if i == 0 :
         continue
       if len(line) == 6:
         guid = "%s - %s" % (set_type, i)
         question_a = tokenization.convert_to_unicode(line[3])
         question_b = tokenization.convert_to_unicode(line[4])
         if set_type == "test":
           label = "0"
         else:
           label = tokenization.convert_to_unicode(line[5])
         examples.append(
           InputExample(guid=guid, text_a=question_a, text_b=question_b, label=label))
     return examples 

mark

why is the test label all set to 0? I see some 1 in the actual data

@bamba518 I suspect it's not used at all, a lot of the test sets don't even carry the real label information.

@jageshmaharjan Thank you for this. Why do you add the line if len(line) == 6: though? I had to remove it to make it work. Again cheers!

Oop, i didn't update some changes here. I'll make an update sooner.

Did you manage to reproduce the results for QQP? I'm having a lot of trouble with it. see #475

Was this page helpful?
0 / 5 - 0 ratings