The Wayback Machine - https://web.archive.org/web/20201210063351/https://github.com/github/CodeSearchNet/issues/83
Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

A minor Java tokenization utf-related issue #83

Closed
bzz opened this issue Oct 27, 2019 · 3 comments
Closed

A minor Java tokenization utf-related issue #83

bzz opened this issue Oct 27, 2019 · 3 comments
Assignees

Comments

@bzz
Copy link
Contributor

@bzz bzz commented Oct 27, 2019

This may not be something very important or worth fixing immediately, but there may be a small bug in Java function tokenization.

At least one function in the dataset has code_tokens that do not include a { token.


Quick inspection with

with pd.option_context('display.max_colwidth', -1):
    display(jdf.loc[jdf['url'] == 'https://github.com/jbehave/jbehave-core/blob/bdd6a6199528df3c35087e72d4644870655b23e6/examples/i18n/src/main/java/org/jbehave/examples/trader/i18n/steps/DeSteps.java#L22-L25'][['code', 'code_tokens']])

shows tokens like tring , ymbol for this code

@Given("ich habe eine Aktion mit dem Symbol $sümbol und eine Schwelle von $threshold")
public void aStock(@Named("sümbol") String symbol, @Named("threshold") double threshold) { ...

code_tokens looks like this

[@, Given, (, "ich habe eine Aktion mit dem Symbol  𝑠ü𝑚𝑏𝑜𝑙𝑢𝑛𝑑𝑒𝑖𝑛𝑒𝑆𝑐ℎ𝑤𝑒𝑙𝑙𝑒𝑣𝑜𝑛 threshold"), , public, void, aStock, (, @, Named, (, "sümbol"), , tring , ymbol,, , N, amed(, ", threshold"), ...]

I'm not very familiar with the extraction pipeline codebase, but the fact that tree-sitter seems to identify the locations well
Screen Shot 2019-10-27 at 2 52 10 PM makes me think that JavaParser.get_definition(), that is doing some index math, may be worth closer inspection.

@hamelsmu
Copy link
Member

@hamelsmu hamelsmu commented Oct 29, 2019

@mallamanis I am not that familiar with the Java tokenizer, is this something you understand more?

@mallamanis
Copy link
Collaborator

@mallamanis mallamanis commented Oct 30, 2019

Hi both, I'll add this to my queue of things to check. As Alex mentions, it doesn't seem urgent.

@mallamanis mallamanis self-assigned this Oct 30, 2019
@bzz
Copy link
Contributor Author

@bzz bzz commented Oct 30, 2019

One more example that may not be related, but if it is, would make me think about possibility of some off-by-one rather than a UTF 🐞 . But will be happy to move it to a separate issue as well.

Here are some java functions which code, url and code_tokens are missing a number of LoC at the end and thus making it hard to parse.

Update: after parsing whole Java dataset of 496k functions, there were only 978 cases (~0.2%) that failed to parse (some of which due to Java version mismatch, etc).

Steps to reproduce
with pd.option_context('display.max_colwidth', -1):
    display(df[df.code.str.contains("getNamesForType")][["url", "code"]])

And again, tree-sitter seems to identify the location of the end of the block just fine
Screen Shot 2019-10-30 at 6 58 32 PM

Thanks again for the great work putting it all together and the prompt reply, from my side I'll also try to find some time to dig deeper into this.

@hamelsmu hamelsmu closed this Sep 4, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
3 participants
You can’t perform that action at this time.