-
Notifications
You must be signed in to change notification settings - Fork 1.9k
C++: Represent temporary object initialization in AST and IR #4432
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
This change adds a new module, `PrintIRLocalFlow.qll`, which can be imported into any query that uses both `PrintIR.qll` and the IR dataflow library. The IR dump printed by `PrintIR.qll` will be annotated with information about how each operand and instruction participates in dataflow.
For each operand and instruction, the following propeties are displayed:
- `flow`: Which local operands/instructions have flow to this node, and which local operands/instruction this node has flow to.
- `source`: `true` if this node is a source
- `sink`: `true` if this node is a sink
- `barrier`: Lists which kinds of barrier this node is. Can be zero or more of `full`, `in`, `out`, and `guard`. If the node is a guard barrier, the IR of the guarding instruction is also printed.
We already had a way to print additional properties for instructions and blocks, but not for operands. I added support for operand properties to `IRPropertyProvider`. These are now printed in a curly-brace-enclosed list immediately after the corresponding operand.
When printing flow, instructions are identified by their result ID (e.g., `m128`). Operands are identified by both the result ID of their instruction and their kind (e.g., `r145.left`). For flow from an operand to its use instruction, it just prints `result` at the operand, and prints only the operand kind on the instruction.
Example output:
```
# 344| m344_34(vector<int, allocator<int>>) = Chi : total:m344_20{flow:def->@, @->result}, partial:m344_33{flow:def->@, @->result}
# 344| flow = total->@, partial->@, +m344_33->@, @->+r347_3, @->v347_7.side_effect, @->m347_9.total, @->m344_20.1
```
The `+` annotations indicate when the flow came from `isAdditionalFlowStep()`, rather than built-in local flow.
PrintAST now works on `library-tests/dataflow/taint-tests`.
| cached | ||
| private predicate instructionTaintStep(Instruction i1, Instruction i2) { | ||
| private predicate commonTaintStep(DataFlow::Node fromNode, DataFlow::Node toNode) { | ||
| instructionToInstructionTaintStep(fromNode.asInstruction(), toNode.asInstruction()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To be consistent with how we interleave instruction -> operand steps in dataflow we should probably convert this to pure instruction -> operand and operand -> instruction flow at some point once this PR has been merged. But I'm fine with these changes as they are now!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I just didn't want to change more than I already had to.
|
Very few diffs. The |
|
JDK diff looks like a true positive. A call to a pure constructor wasn't treated as being a |
|
php-src diff did not repo locally. |
|
Diffs look good. Perf is nearly unchanged. This is ready for final sign-off. |
geoffw0
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've just reviewed the code changes as best I can and they look good to me. I remember checking the taint test result changes from this PR quite thoroughly before and was very happy with them as well.
A comment by @jbj appears to have gone missing from the front page of the PR. Apart from that, it appears all concerns have been addressed.
I'm keen to merge this today. We've certainly all had plenty of opportunity to review at this point. I've started SAMATE runs (https://jenkins.internal.semmle.com/job/Security/job/SAMATE/job/SAMATE-cpp/548/ and https://jenkins.internal.semmle.com/job/Security/job/SAMATE/job/SAMATE-cpp2/31/) on this branch, mostly for my own information - they will take ~18 hours to complete so lets not wait for them.
|
Please don't merge this PR until the |
I foolishly assumed that had happened yesterday as planned. Is there somewhere I can check for sure, or can you post here when it's done? |
It's done, and we can move forward. But this branch can only be merged together with its internal PR, and the internal PR has both a test failure and a submodule conflict. |
The fixups that I'm doing involve introducing new temporary objects where there were none before, so I'm not sure how I'd do that in the AST layer without drastically changing how we implement the AST. I've considered fixing this all up in the extractor, but if it's primarily for the IR's benefit, I think I'd prefer to add more QL to the IR construction than more C++ to the extractor. |
A performance regression in `definitionByReferenceNodeFromArgument#ff` was ultimately caused by a join on parameter indexes in `DefinitionByReferenceNode.getArgument`. Joining on numbers in QL is always fragile, and somehow the changes in github#4432 had caused the join order here to break. Instead of tweaking the join order in the slow predicate itself, I added `pragma[noinline]` to one of the predicates involved in the join on parameter indexes. This should prevent us from getting similar performance problems in the future when we write code that joins on parameter numbers. Joining on indexes is always risky, but it's even more risky when one of the predicates in the join is inlined by the compiler and expands to further joins. I tested performance by running `CgiXss.ql` on a ChakraCore snapshot. Tuple counts before (I interrupted execution after five minutes or so): (626s) Tuple counts for DataFlowUtil::definitionByReferenceNodeFromArgument#ff: 58162 ~0% {3} r1 = SCAN DataFlowUtil::DefinitionByReferenceNode#class#ff AS I OUTPUT I.<1>, -1, I.<0> 26934 ~0% {2} r2 = JOIN r1 WITH Instruction::IndexedInstruction#ff AS R ON FIRST 2 OUTPUT r1.<0>, r1.<2> 26934 ~1% {2} r3 = JOIN r2 WITH Instruction::SideEffectInstruction::getPrimaryInstruction_dispred#3#ff AS R ON FIRST 1 OUTPUT R.<1>, r2.<1> 26850 ~1% {2} r4 = JOIN r3 WITH Instruction::CallInstruction::getThisArgumentOperand_dispred#ff AS R ON FIRST 1 OUTPUT R.<1>, r3.<1> 26850 ~0% {2} r5 = JOIN r4 WITH Operand::Operand::getDef_dispred#3#ff AS R ON FIRST 1 OUTPUT R.<1>, r4.<1> 26850 ~1% {2} r6 = JOIN r5 WITH Instruction::Instruction::getUnconvertedResultExpression_dispred#ff AS R ON FIRST 1 OUTPUT R.<1>, r5.<1> 58162 ~0% {2} r7 = SCAN DataFlowUtil::DefinitionByReferenceNode#class#ff AS I OUTPUT I.<1>, I.<0> 58162 ~4% {3} r8 = JOIN r7 WITH Instruction::IndexedInstruction#ff AS R ON FIRST 1 OUTPUT R.<1>, r7.<1>, r7.<0> 4026581120 ~0% {4} r9 = JOIN r8 WITH Instruction::CallInstruction::getPositionalArgumentOperand_dispred#fff_102#join_rhs AS R ON FIRST 1 OUTPUT r8.<2>, R.<1>, r8.<1>, R.<2> 31154 ~4% {2} r10 = JOIN r9 WITH Instruction::SideEffectInstruction::getPrimaryInstruction_dispred#3#ff AS R ON FIRST 2 OUTPUT r9.<3>, r9.<2> 31154 ~8% {2} r11 = JOIN r10 WITH Operand::Operand::getDef_dispred#3#ff AS R ON FIRST 1 OUTPUT R.<1>, r10.<1> 31154 ~0% {2} r12 = JOIN r11 WITH Instruction::Instruction::getUnconvertedResultExpression_dispred#ff AS R ON FIRST 1 OUTPUT R.<1>, r11.<1> 58004 ~0% {2} r13 = r6 \/ r12 return r13 Tuple counts after: (0s) Tuple counts for DataFlowUtil::definitionByReferenceNodeFromArgument#ff: 385785 ~6% {2} r1 = SCAN DataFlowUtil::DefinitionByReferenceNode#class#ff AS I OUTPUT I.<1>, I.<0> 385785 ~0% {3} r2 = JOIN r1 WITH Instruction::IndexedInstruction#ff AS R ON FIRST 1 OUTPUT r1.<0>, r1.<1>, R.<1> 385785 ~1% {3} r3 = JOIN r2 WITH Instruction::SideEffectInstruction::getPrimaryInstruction_dispred#3#ff AS R ON FIRST 1 OUTPUT R.<1>, r2.<2>, r2.<1> 198736 ~4% {2} r4 = JOIN r3 WITH Instruction::CallInstruction::getPositionalArgument#fff AS R ON FIRST 2 OUTPUT R.<2>, r3.<2> 198736 ~0% {2} r5 = JOIN r4 WITH Instruction::Instruction::getUnconvertedResultExpression_dispred#ff AS R ON FIRST 1 OUTPUT R.<1>, r4.<1> 385785 ~1% {3} r6 = SCAN DataFlowUtil::DefinitionByReferenceNode#class#ff AS I OUTPUT I.<1>, -1, I.<0> 186891 ~1% {2} r7 = JOIN r6 WITH Instruction::IndexedInstruction#ff AS R ON FIRST 2 OUTPUT r6.<0>, r6.<2> 186891 ~2% {2} r8 = JOIN r7 WITH Instruction::SideEffectInstruction::getPrimaryInstruction_dispred#3#ff AS R ON FIRST 1 OUTPUT R.<1>, r7.<1> 183201 ~3% {2} r9 = JOIN r8 WITH Instruction::CallInstruction::getThisArgumentOperand_dispred#ff AS R ON FIRST 1 OUTPUT R.<1>, r8.<1> 183201 ~0% {2} r10 = JOIN r9 WITH Operand::Operand::getDef_dispred#3#ff AS R ON FIRST 1 OUTPUT R.<1>, r9.<1> 175449 ~8% {2} r11 = JOIN r10 WITH Instruction::Instruction::getUnconvertedResultExpression_dispred#ff AS R ON FIRST 1 OUTPUT R.<1>, r10.<1> 374185 ~3% {2} r12 = r5 \/ r11 return r12


Fixes github/codeql-c-analysis-team#157, which is the first part of github/codeql-c-analysis-team#154.
This is the
github/codeqlside of the work. The dbscheme is updated to add a new expression kind,@temp_init, which is treated as a conversion and thus not in the parent-child tree in the AST. These expressions are represented by the newTemporaryObjectExpr(naming bikeshed still open), which has the initializer as itsgetExpr(). Very few queries should notice the change, since most don't look at conversions anyway, and those that do generally care specifically about numeric conversions anyway.The IR changes to support these expressions went pretty smoothly. The code is very similar to how we handle
ThrowExpr, where we also have an initializer that initializes a temporary object.This fixes all remaining consistency failures in the IR test directory, as well as a bunch of similar consistency failures in the
syntax-zoo.Most of the test diffs are just in tests that dump lists of expected expressions, and now include the occasional
TemporaryObjectExpr. However, the diffs inlibrary-tests/dataflow/taint-testsbear close examination (@geoffw0?):std::make_pair("foo", "blah")return astd::pair<const char*, const char*>, rather thanstd::pair<const char[4], const char[5]>(the argument types are supposed to undergo "decay").isParamDeref()instead ofisParam()as appropriate.I also made some changes to how local taint flow is computed in
DefaultTaintTracking.qll. The main change is that model flow now goes fromOperand->Instruction, rather thanInstruction->Instruction. This fixes a couple dozen cases of missed taint flow inlibrary-tests/dataflow/DefaultTaintTracking.To do: