A minor Java tokenization utf-related issue #83

bzz · 2019-10-27T14:02:59Z

This may not be something very important or worth fixing immediately, but there may be a small bug in Java function tokenization.

At least one function in the dataset has code_tokens that do not include a { token.

Quick inspection with

with pd.option_context('display.max_colwidth', -1):
    display(jdf.loc[jdf['url'] == 'https://github.com/jbehave/jbehave-core/blob/bdd6a6199528df3c35087e72d4644870655b23e6/examples/i18n/src/main/java/org/jbehave/examples/trader/i18n/steps/DeSteps.java#L22-L25'][['code', 'code_tokens']])

shows tokens like tring , ymbol for this code

@Given("ich habe eine Aktion mit dem Symbol $sümbol und eine Schwelle von $threshold")
public void aStock(@Named("sümbol") String symbol, @Named("threshold") double threshold) { ...

code_tokens looks like this

[@, Given, (, "ich habe eine Aktion mit dem Symbol  𝑠ü𝑚𝑏𝑜𝑙𝑢𝑛𝑑𝑒𝑖𝑛𝑒𝑆𝑐ℎ𝑤𝑒𝑙𝑙𝑒𝑣𝑜𝑛 threshold"), , public, void, aStock, (, @, Named, (, "sümbol"), , tring , ymbol,, , N, amed(, ", threshold"), ...]

I'm not very familiar with the extraction pipeline codebase, but the fact that tree-sitter seems to identify the locations well
makes me think that JavaParser.get_definition(), that is doing some index math, may be worth closer inspection.

hamelsmu · 2019-10-29T00:25:34Z

@mallamanis I am not that familiar with the Java tokenizer, is this something you understand more?

mallamanis · 2019-10-30T16:27:51Z

Hi both, I'll add this to my queue of things to check. As Alex mentions, it doesn't seem urgent.

bzz · 2019-10-30T18:23:34Z

One more example that may not be related, but if it is, would make me think about possibility of some off-by-one rather than a UTF 🐞 . But will be happy to move it to a separate issue as well.

Here are some java functions which code, url and code_tokens are missing a number of LoC at the end and thus making it hard to parse.

Update: after parsing whole Java dataset of 496k functions, there were only 978 cases (~0.2%) that failed to parse (some of which due to Java version mismatch, etc).

Steps to reproduce

with pd.option_context('display.max_colwidth', -1):
    display(df[df.code.str.contains("getNamesForType")][["url", "code"]])

And again, tree-sitter seems to identify the location of the end of the block just fine

Thanks again for the great work putting it all together and the prompt reply, from my side I'll also try to find some time to dig deeper into this.

mallamanis self-assigned this Oct 30, 2019

hamelsmu closed this Sep 4, 2020

Nov	DEC	Jan
	10
2019	2020	2021

github / CodeSearchNet

A minor Java tokenization utf-related issue #83

A minor Java tokenization utf-related issue #83

bzz commented Oct 27, 2019

hamelsmu commented Oct 29, 2019

mallamanis commented Oct 30, 2019

bzz commented Oct 30, 2019 •

edited

github / CodeSearchNet

Join GitHub today

GitHub is where the world builds software

A minor Java tokenization utf-related issue #83

A minor Java tokenization utf-related issue #83

Comments

bzz commented Oct 27, 2019

hamelsmu commented Oct 29, 2019

mallamanis commented Oct 30, 2019

bzz commented Oct 30, 2019 • edited

Essential cookies

Always active

Analytics cookies

bzz commented Oct 30, 2019 •

edited