Just a quick question:
Has anyone done with preprocessor for "Quora Question duplicate, QQP dataset"?
If not I'll write a preprocessor for QQP Dataset in GLUE.
just in case anyone needs this: almost same as MRPC except the line index.
class QuoraProcessor(DataProcessor):
"""Processor for Quora Question duplicate dataset (GLUE Version)"""
def get_train_examples(self, data_dir):
return self._create_examples(
self._read_tsv(os.path.join(data_dir, "train.tsv")), "train" )
def get_dev_examples(self, data_dir):
return self._create_examples(
self._read_tsv(os.path.join(data_dir, "dev.tsv")) , "dev" )
def get_test_examples(self, data_dir):
return self._create_examples(
self._read_tsv(os.path.join(data_dir, "test.tsv")), "test" )
def get_labels(self):
return ["0", "1"]
def _create_examples(self, lines, set_type):
examples = []
for (i, line) in enumerate(lines):
if i == 0 :
continue
if len(line) == 6:
guid = "%s - %s" % (set_type, i)
question_a = tokenization.convert_to_unicode(line[3])
question_b = tokenization.convert_to_unicode(line[4])
if set_type == "test":
label = "0"
else:
label = tokenization.convert_to_unicode(line[5])
examples.append(
InputExample(guid=guid, text_a=question_a, text_b=question_b, label=label))
return examples
mark
why is the test label all set to 0? I see some 1 in the actual data
@bamba518 I suspect it's not used at all, a lot of the test sets don't even carry the real label information.
@jageshmaharjan Thank you for this. Why do you add the line if len(line) == 6: though? I had to remove it to make it work. Again cheers!
Oop, i didn't update some changes here. I'll make an update sooner.
Did you manage to reproduce the results for QQP? I'm having a lot of trouble with it. see #475
Most helpful comment
just in case anyone needs this: almost same as MRPC except the line index.