Observed following exception in Collector logs using Elasticsearch as backend when span names are really long:
{"level":"error","ts":1524250693.569798,"caller":"config/config.go:90","msg":"Elasticsearch could not process bulk request","error":"elastic: Error 400 (Bad Request): Validation Failed: 1: id is too long, must be no longer than 512 bytes but was: 541; [type=action_request_validation_exception]","response":null,"requests":"{\"index\":{\"_id\":\"jaeger-test|start-515-vdadclohlswrzxxbdmsahpfhxcqjleonjkymoiqaixglisojzkkiajlrhdgmtwxmttoezjjkypzpisvzkeuysgurctzbzmhhbcgnnbgzswnwbwlazjznggagnktuvdbsskolpdmmfygajbuquqadvhipybykvwifkabdjgpxtjoggbeggprfzvaahmapprdpeveynztdqtjbrbsdpptrecqhsisnznlaocytmfcxvuteguihobetxaxrgxlkqjydrxrywsholbenemazkexfhlxzrdtogzvvgtgvtloclkalhkordtoxbynqcdtxfinofpmndblncheupynxjwgplmdddoppsjlltxkvtbbjszaogvhrhrehkdebhbrucrfwkhuglifmfjvrpyrysptsaahocxfkfalbdgdyldyeshluxtoiayrvopbuzgguhqczibibuhxmluxhygsztfqpqffcgdkaidkbiriawdfgpjlopbemaksvlhpdnynstlkqmyn-end\",\"_index\":\"jaeger-service-2018-04-20\",\"_type\":\"service\"}}
{\"serviceName\":\"jaeger-test\",\"operationName\":\"start-515-vdadclohlswrzxxbdmsahpfhxcqjleonjkymoiqaixglisojzkkiajlrhdgmtwxmttoezjjkypzpisvzkeuysgurctzbzmhhbcgnnbgzswnwbwlazjznggagnktuvdbsskolpdmmfygajbuquqadvhipybykvwifkabdjgpxtjoggbeggprfzvaahmapprdpeveynztdqtjbrbsdpptrecqhsisnznlaocytmfcxvuteguihobetxaxrgxlkqjydrxrywsholbenemazkexfhlxzrdtogzvvgtgvtloclkalhkordtoxbynqcdtxfinofpmndblncheupynxjwgplmdddoppsjlltxkvtbbjszaogvhrhrehkdebhbrucrfwkhuglifmfjvrpyrysptsaahocxfkfalbdgdyldyeshluxtoiayrvopbuzgguhqczibibuhxmluxhygsztfqpqffcgdkaidkbiriawdfgpjlopbemaksvlhpdnynstlkqmyn-end\"}
{\"index\":{\"_id\":\"jaeger-test|ChildSpan\",\"_index\":\"jaeger-service-2018-04-20\",\"_type\":\"service\"}}
{\"serviceName\":\"jaeger-test\",\"operationName\":\"ChildSpan\"}
{\"index\":{\"_index\":\"jaeger-span-2018-04-20\",\"_type\":\"span\"}}
{\"traceID\":\"ce09cc5e105685ee\",\"spanID\":\"ce09cc5e105685ee\",\"parentSpanID\":\"0\",\"operationName\":\"start-515-vdadclohlswrzxxbdmsahpfhxcqjleonjkymoiqaixglisojzkkiajlrhdgmtwxmttoezjjkypzpisvzkeuysgurctzbzmhhbcgnnbgzswnwbwlazjznggagnktuvdbsskolpdmmfygajbuquqadvhipybykvwifkabdjgpxtjoggbeggprfzvaahmapprdpeveynztdqtjbrbsdpptrecqhsisnznlaocytmfcxvuteguihobetxaxrgxlkqjydrxrywsholbenemazkexfhlxzrdtogzvvgtgvtloclkalhkordtoxbynqcdtxfinofpmndblncheupynxjwgplmdddoppsjlltxkvtbbjszaogvhrhrehkdebhbrucrfwkhuglifmfjvrpyrysptsaahocxfkfalbdgdyldyeshluxtoiayrvopbuzgguhqczibibuhxmluxhygsztfqpqffcgdkaidkbiriawdfgpjlopbemaksvlhpdnynstlkqmyn-end\",\"references\":[],\"startTime\":1524250565976807,\"duration\":394,\"tags\":[{\"key\":\"sampler.type\",\"type\":\"string\",\"value\":\"const\"},{\"key\":\"sampler.param\",\"type\":\"string\",\"value\":\"True\"},{\"key\":\"component\",\"type\":\"string\",\"value\":\"jaeger-test\"}],\"logs\":[{\"timestamp\":1524250565976857,\"fields\":[{\"key\":\"event\",\"type\":\"string\",\"value\":\"{\\\"life\\\": 42, \\\"event\\\": \\\"test message\\\"}\"}]},{\"timestamp\":1524250565976933,\"fields\":[{\"key\":\"event\",\"type\":\"string\",\"value\":\"down below\"}]}],\"processID\":\"\",\"process\":{\"serviceName\":\"jaeger-test\",\"tags\":[{\"key\":\"hostname\",\"type\":\"string\",\"value\":\"M266605UXG8WP\"},{\"key\":\"ip\",\"type\":\"int64\",\"value\":\"2130706433\"},{\"key\":\"jaeger.version\",\"type\":\"string\",\"value\":\"Python-3.7.1\"}]},\"warnings\":null,\"startTimeMillis\":1524250565976}
{\"index\":{\"_index\":\"jaeger-span-2018-04-20\",\"_type\":\"span\"}}
{\"traceID\":\"ce09cc5e105685ee\",\"spanID\":\"dcaadcd0affdb718\",\"parentSpanID\":\"ce09cc5e105685ee\",\"operationName\":\"ChildSpan\",\"references\":[],\"startTime\":1524250565976928,\"duration\":11,\"tags\":[{\"key\":\"component\",\"type\":\"string\",\"value\":\"jaeger-test\"}],\"logs\":[],\"processID\":\"\",\"process\":{\"serviceName\":\"jaeger-test\",\"tags\":[{\"key\":\"ip\",\"type\":\"int64\",\"value\":\"2130706433\"}]},\"warnings\":null,\"startTimeMillis\":1524250565976}
{\"index\":{\"_index\":\"jaeger-span-2018-04-20\",\"_type\":\"span\"}}
{\"traceID\":\"3c1b203ab513df76\",\"spanID\":\"4df8725a28167f40\",\"parentSpanID\":\"3c1b203ab513df76\",\"operationName\":\"ChildSpan\",\"references\":[],\"startTime\":1524250638506012,\"duration\":10,\"tags\":[{\"key\":\"component\",\"type\":\"string\",\"value\":\"jaeger-test\"}],\"logs\":[],\"processID\":\"\",\"process\":{\"serviceName\":\"jaeger-test\",\"tags\":[{\"key\":\"ip\",\"type\":\"int64\",\"value\":\"2130706433\"}]},\"warnings\":null,\"startTimeMillis\":1524250638506}
{\"index\":{\"_id\":\"jaeger-test|start-12-fqvknsjqeijw-end\",\"_index\":\"jaeger-service-2018-04-20\",\"_type\":\"service\"}}
{\"serviceName\":\"jaeger-test\",\"operationName\":\"start-12-fqvknsjqeijw-end\"}
{\"index\":{\"_index\":\"jaeger-span-2018-04-20\",\"_type\":\"span\"}}
{\"traceID\":\"3c1b203ab513df76\",\"spanID\":\"3c1b203ab513df76\",\"parentSpanID\":\"0\",\"operationName\":\"start-12-fqvknsjqeijw-end\",\"references\":[],\"startTime\":1524250638505913,\"duration\":261,\"tags\":[{\"key\":\"sampler.type\",\"type\":\"string\",\"value\":\"const\"},{\"key\":\"sampler.param\",\"type\":\"string\",\"value\":\"True\"},{\"key\":\"component\",\"type\":\"string\",\"value\":\"jaeger-test\"}],\"logs\":[{\"timestamp\":1524250638505959,\"fields\":[{\"key\":\"event\",\"type\":\"string\",\"value\":\"{\\\"life\\\": 42, \\\"event\\\": \\\"test message\\\"}\"}]},{\"timestamp\":1524250638506017,\"fields\":[{\"key\":\"event\",\"type\":\"string\",\"value\":\"down below\"}]}],\"processID\":\"\",\"process\":{\"serviceName\":\"jaeger-test\",\"tags\":[{\"key\":\"hostname\",\"type\":\"string\",\"value\":\"M266605UXG8WP\"},{\"key\":\"ip\",\"type\":\"int64\",\"value\":\"2130706433\"},{\"key\":\"jaeger.version\",\"type\":\"string\",\"value\":\"Python-3.7.1\"}]},\"warnings\":null,\"startTimeMillis\":1524250638505}","stacktrace":"github.com/jaegertracing/jaeger/pkg/es/config.(*Configuration).NewClient.func2\n\t/home/travis/gopath/src/github.com/jaegertracing/jaeger/pkg/es/config/config.go:90\ngithub.com/jaegertracing/jaeger/vendor/gopkg.in/olivere/elastic%2ev5.(*bulkWorker).commit\n\t/home/travis/gopath/src/github.com/jaegertracing/jaeger/vendor/gopkg.in/olivere/elastic.v5/bulk_processor.go:506\ngithub.com/jaegertracing/jaeger/vendor/gopkg.in/olivere/elastic%2ev5.(*bulkWorker).work\n\t/home/travis/gopath/src/github.com/jaegertracing/jaeger/vendor/gopkg.in/olivere/elastic.v5/bulk_processor.go:455"}
Apparently, span name is also used as "_id" of document when storing span information in Elasticsearch and for spans with long names ( longer than 512 bytes), ES bulk operation fails with above exception.
Is there a limit on max length for span names?
Sample script to reproduce the issue:
#!/usr/bin/env python
import logging
import time
import string
import random
from jaeger_client import Config
def main():
log_level = logging.DEBUG
logging.getLogger('').handlers = []
logging.basicConfig(format='%(asctime)s %(message)s', level=log_level)
config = Config(
config={ # usually read from some yaml config
'sampler': {
'type': 'const',
'param': 1,
},
'logging': True,
},
service_name='jaeger-test'
)
# this call also sets opentracing.tracer
tracer = config.initialize_tracer()
max_length = 515
span_name = "start-{}-{}-end".format(max_length, ''.join(random.choice(string.lowercase) for x in range(max_length)))
with tracer.start_span(span_name) as span:
span.log_kv({'event': 'test message', 'life': 42})
with tracer.start_span('ChildSpan', child_of=span) as child_span:
span.log_kv({'event': 'down below'})
time.sleep(2) # yield to IOLoop to flush the spans - https://github.com/jaegertracing/jaeger-client-python/issues/50
tracer.close() # flush any buffered spans
if __name__ == "__main__":
main()
Also, it looks like once collector gets one such span, it gets into a loop trying to insert this trace/span information and keeps failing and doesn't store any new traces/spans
Apparently, span name is also used as "_id" of document when storing span information
@black-adder is this for storing the actual span or the additional indices? I don't see why span name needs to be in the _id, we never query by that ID and can simply let ES assign a random ID to the documents.
I had this problem too.
If I understand correctly this index name is created here:
https://github.com/jaegertracing/jaeger/blob/master/plugin/storage/es/spanstore/service_operation.go#L77
Q1: does it have to be unique?
Q2: when do we query ES using that index name?
For every service/operation that we see, we create an index entry for it so that users can quickly look up service/operation. We manually generate the _id so that if we see the same service/operation pair, it's only stored once. If we remove this, we'll be writing the same info into the index which will bloat the index unnecessarily. We can remove the automatic generation of the id but I prefer it to be opt out instead.
q1) doesn't have to be but it should be more performant
q2) when we want to look up existing service/operation pairs
@black-adder AFAIU cache key doesn't have to be the same as document _id, at least by looking at
https://github.com/tmszdmsk/jaeger/blob/6f43001334a989833a289c5b2a62cc2428f9d564/plugin/storage/es/spanstore/service_operation.go#L79-L81
I couldn't find any other operation that modifies or reads serviceCache. Where is it accessed then?
The code you quoted both reads and modifies the cache.
https://github.com/jaegertracing/jaeger/blob/6f43001334a989833a289c5b2a62cc2428f9d564/plugin/storage/es/spanstore/service_operation.go#L79-L82
We don't care which document ID is used in ES as long as it is stable, to avoid duplicate entries. If simple concatenation is causing issues for ES, we can apply a hashing function instead.
@yurishkuro that's why I've linked it ;). Using this cache doesn't require setting client-side _id in ES.
that in memory cache is used so that unique service/operation indexes are only written at most once an hour (just an optimization). We would still need _id in ES to reduce duplicate index writes.
Ah, ok, I totally get it now, it does document deduplication on an ES level. I agree with @yurishkuro then, we can hash the cacheKey and put it as _id.
I think #905 solves this issue, so closing. Having said that, the original premise of this ticket that a combination of serviceName+operationName doesn't fit in the ES _id length sounds fundamentally wrong to me. E.g. the operation name was "start-515-vdadclohlswrzxxbdmsahpfhxcqjleonjkymoiqaixglisojzkkiajlrhdgmtwxmttoezjjkypzpisvzkeuysgurctzbzmhhbcgnnbgzswnwbwlazjznggagnktuvdbsskolpdmmfygajbuquqadvhipybykvwifkabdjgpxtjoggbeggprfzvaahmapprdpeveynztdqtjbrbsdpptrecqhsisnznlaocytmfcxvuteguihobetxaxrgxlkqjydrxrywsholbenemazkexfhlxzrdtogzvvgtgvtloclkalhkordtoxbynqcdtxfinofpmndblncheupynxjwgplmdddoppsjlltxkvtbbjszaogvhrhrehkdebhbrucrfwkhuglifmfjvrpyrysptsaahocxfkfalbdgdyldyeshluxtoiayrvopbuzgguhqczibibuhxmluxhygsztfqpqffcgdkaidkbiriawdfgpjlopbemaksvlhpdnynstlkqmyn-end", which looks like garbage to me, indicating an issue with instrumentation.
@yurishkuro yes #905 will solve this. Re. your comment regarding issue with instrumentation that is not true, there was nothing wrong with our instrumentation, we had legitimate long operation names which were causing this issue, "garbage" you mentioned is just a dummy operation name I created to illustrate the issue without revealing our internal operation names to prevent information disclosure.
@bharat-p fair enough. The new release 1.6.0 is building now.