Join GitHub today
GitHub is home to over 50 million developers working together to host and review code, manage projects, and build software together.
Sign upGitHub is where the world builds software
Millions of developers and companies build, ship, and maintain their software on GitHub — the largest and most advanced development platform in the world.
A minor Java tokenization utf-related issue #83
Comments
|
@mallamanis I am not that familiar with the Java tokenizer, is this something you understand more? |
|
Hi both, I'll add this to my queue of things to check. As Alex mentions, it doesn't seem urgent. |
|
One more example that may not be related, but if it is, would make me think about possibility of some off-by-one rather than a UTF Here are some java functions which Update: after parsing whole Java dataset of 496k functions, there were only 978 cases (~0.2%) that failed to parse (some of which due to Java version mismatch, etc). Steps to reproducewith pd.option_context('display.max_colwidth', -1):
display(df[df.code.str.contains("getNamesForType")][["url", "code"]])And again, tree-sitter seems to identify the location of the end of the block just fine Thanks again for the great work putting it all together and the prompt reply, from my side I'll also try to find some time to dig deeper into this. |

Formed in 2009, the Archive Team (not to be confused with the archive.org Archive-It Team) is a rogue archivist collective dedicated to saving copies of rapidly dying or deleted websites for the sake of history and digital heritage. The group is 100% composed of volunteers and interested parties, and has expanded into a large amount of related projects for saving online and digital history.


This may not be something very important or worth fixing immediately, but there may be a small bug in Java function tokenization.
At least one function in the dataset has
code_tokensthat do not include a{token.Quick inspection with
shows tokens like
tring , ymbolfor this codecode_tokenslooks like thisI'm not very familiar with the extraction pipeline codebase, but the fact that tree-sitter seems to identify the locations well
makes me think that JavaParser.get_definition(), that is doing some index math, may be worth closer inspection.