The Wayback Machine - https://web.archive.org/web/20251226211144/https://github.com/github/codeql/pull/4432
Skip to content

Conversation

@dbartol
Copy link

@dbartol dbartol commented Oct 7, 2020

Fixes github/codeql-c-analysis-team#157, which is the first part of github/codeql-c-analysis-team#154.

This is the github/codeql side of the work. The dbscheme is updated to add a new expression kind, @temp_init, which is treated as a conversion and thus not in the parent-child tree in the AST. These expressions are represented by the new TemporaryObjectExpr (naming bikeshed still open), which has the initializer as its getExpr(). Very few queries should notice the change, since most don't look at conversions anyway, and those that do generally care specifically about numeric conversions anyway.

The IR changes to support these expressions went pretty smoothly. The code is very similar to how we handle ThrowExpr, where we also have an initializer that initializes a temporary object.

This fixes all remaining consistency failures in the IR test directory, as well as a bunch of similar consistency failures in the syntax-zoo.

Most of the test diffs are just in tests that dump lists of expected expressions, and now include the occasional TemporaryObjectExpr. However, the diffs in library-tests/dataflow/taint-tests bear close examination (@geoffw0?):

  • This change fixes dozens of false negatives.
  • Fixes several ODR violations that were making it difficult to dump ASTs or IR.
  • Subtle tweaks to our mock STL implementation to make the behavior consistent with the standard. The biggest of these is to make std::make_pair("foo", "blah") return a std::pair<const char*, const char*>, rather than std::pair<const char[4], const char[5]> (the argument types are supposed to undergo "decay").
  • Several models have been updated to use isParamDeref() instead of isParam() as appropriate.

I also made some changes to how local taint flow is computed in DefaultTaintTracking.qll. The main change is that model flow now goes from Operand->Instruction, rather than Instruction->Instruction. This fixes a couple dozen cases of missed taint flow in library-tests/dataflow/DefaultTaintTracking.

To do:

@dbartol dbartol added C++ depends on internal PR This PR should only be merged in sync with an internal Semmle PR labels Oct 7, 2020
Dave Bartolomeo added 21 commits October 7, 2020 13:14
This change adds a new module, `PrintIRLocalFlow.qll`, which can be imported into any query that uses both `PrintIR.qll` and the IR dataflow library. The IR dump printed by `PrintIR.qll` will be annotated with information about how each operand and instruction participates in dataflow.

For each operand and instruction, the following propeties are displayed:
- `flow`: Which local operands/instructions have flow to this node, and which local operands/instruction this node has flow to.
- `source`: `true` if this node is a source
- `sink`: `true` if this node is a sink
- `barrier`: Lists which kinds of barrier this node is. Can be zero or more of `full`, `in`, `out`, and `guard`. If the node is a guard barrier, the IR of the guarding instruction is also printed.

We already had a way to print additional properties for instructions and blocks, but not for operands. I added support for operand properties to `IRPropertyProvider`. These are now printed in a curly-brace-enclosed list immediately after the corresponding operand.

When printing flow, instructions are identified by their result ID (e.g., `m128`). Operands are identified by both the result ID of their instruction and their kind (e.g., `r145.left`). For flow from an operand to its use instruction, it just prints `result` at the operand, and prints only the operand kind on the instruction.

Example output:
```
#  344|     m344_34(vector<int, allocator<int>>)                                               = Chi                             : total:m344_20{flow:def->@, @->result}, partial:m344_33{flow:def->@, @->result}
#  344|         flow = total->@, partial->@, +m344_33->@, @->+r347_3, @->v347_7.side_effect, @->m347_9.total, @->m344_20.1
```
The `+` annotations indicate when the flow came from `isAdditionalFlowStep()`, rather than built-in local flow.
PrintAST now works on `library-tests/dataflow/taint-tests`.
@github-actions github-actions bot added the C# label Oct 17, 2020
cached
private predicate instructionTaintStep(Instruction i1, Instruction i2) {
private predicate commonTaintStep(DataFlow::Node fromNode, DataFlow::Node toNode) {
instructionToInstructionTaintStep(fromNode.asInstruction(), toNode.asInstruction())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be consistent with how we interleave instruction -> operand steps in dataflow we should probably convert this to pure instruction -> operand and operand -> instruction flow at some point once this PR has been merged. But I'm fine with these changes as they are now!

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I just didn't want to change more than I already had to.

@dbartol
Copy link
Author

dbartol commented Nov 2, 2020

Very few diffs. The implicit-bitfield-downcast is a side effect of fixing a "one Expr with multiple immediate Conversions" issue. The correct fix is suggested in https://github.com/github/codeql-c-analysis-team/issues/180.

@dbartol
Copy link
Author

dbartol commented Nov 2, 2020

JDK diff looks like a true positive. A call to a pure constructor wasn't treated as being a PureExprInVoidContext because it had a void return type. With the addition of the temporary object into which the object is constructed, the return type is now the type of the class, so we'll treat it the same as a call to a pure function that returns that class.

@dbartol
Copy link
Author

dbartol commented Nov 2, 2020

php-src diff did not repo locally.

@dbartol
Copy link
Author

dbartol commented Nov 2, 2020

Diffs look good. Perf is nearly unchanged. This is ready for final sign-off.

Copy link
Contributor

@geoffw0 geoffw0 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've just reviewed the code changes as best I can and they look good to me. I remember checking the taint test result changes from this PR quite thoroughly before and was very happy with them as well.

A comment by @jbj appears to have gone missing from the front page of the PR. Apart from that, it appears all concerns have been addressed.

I'm keen to merge this today. We've certainly all had plenty of opportunity to review at this point. I've started SAMATE runs (https://jenkins.internal.semmle.com/job/Security/job/SAMATE/job/SAMATE-cpp/548/ and https://jenkins.internal.semmle.com/job/Security/job/SAMATE/job/SAMATE-cpp2/31/) on this branch, mostly for my own information - they will take ~18 hours to complete so lets not wait for them.

@jbj
Copy link
Contributor

jbj commented Nov 3, 2020

Please don't merge this PR until the rc/1.26 branch has been created. At the time of writing this, it's waiting for a submodule bump to finish testing.

@geoffw0
Copy link
Contributor

geoffw0 commented Nov 3, 2020

Please don't merge this PR until the rc/1.26 branch has been created.

I foolishly assumed that had happened yesterday as planned. Is there somewhere I can check for sure, or can you post here when it's done?

@jbj
Copy link
Contributor

jbj commented Nov 3, 2020

I foolishly assumed that had happened yesterday as planned. Is there somewhere I can check for sure, or can you post here when it's done?

It's done, and we can move forward. But this branch can only be merged together with its internal PR, and the internal PR has both a test failure and a submodule conflict.

@dbartol
Copy link
Author

dbartol commented Nov 3, 2020

@jbj

You've added a lot of code to the IR generation to bridge between how the IR likes to see the AST and how the AST looks in the database, especially with the last commit about member accesses on prvalues. Would it be an advantage to move any of that code to the layer that's in between the IR and the database: the QL AST classes?

The fixups that I'm doing involve introducing new temporary objects where there were none before, so I'm not sure how I'd do that in the AST layer without drastically changing how we implement the AST. I've considered fixing this all up in the extractor, but if it's primarily for the IR's benefit, I think I'd prefer to add more QL to the IR construction than more C++ to the extractor.

@igfoo igfoo merged commit 6ff939d into github:main Nov 4, 2020
jbj added a commit to jbj/ql that referenced this pull request Nov 9, 2020
A performance regression in `definitionByReferenceNodeFromArgument#ff`
was ultimately caused by a join on parameter indexes in
`DefinitionByReferenceNode.getArgument`. Joining on numbers in QL is
always fragile, and somehow the changes in github#4432 had caused the join
order here to break.

Instead of tweaking the join order in the slow predicate itself, I added
`pragma[noinline]` to one of the predicates involved in the join on
parameter indexes. This should prevent us from getting similar
performance problems in the future when we write code that joins on
parameter numbers. Joining on indexes is always risky, but it's even
more risky when one of the predicates in the join is inlined by the
compiler and expands to further joins.

I tested performance by running `CgiXss.ql` on a ChakraCore snapshot.
Tuple counts before (I interrupted execution after five minutes or so):

    (626s) Tuple counts for DataFlowUtil::definitionByReferenceNodeFromArgument#ff:
    58162      ~0%     {3} r1 = SCAN DataFlowUtil::DefinitionByReferenceNode#class#ff AS I OUTPUT I.<1>, -1, I.<0>
    26934      ~0%     {2} r2 = JOIN r1 WITH Instruction::IndexedInstruction#ff AS R ON FIRST 2 OUTPUT r1.<0>, r1.<2>
    26934      ~1%     {2} r3 = JOIN r2 WITH Instruction::SideEffectInstruction::getPrimaryInstruction_dispred#3#ff AS R ON FIRST 1 OUTPUT R.<1>, r2.<1>
    26850      ~1%     {2} r4 = JOIN r3 WITH Instruction::CallInstruction::getThisArgumentOperand_dispred#ff AS R ON FIRST 1 OUTPUT R.<1>, r3.<1>
    26850      ~0%     {2} r5 = JOIN r4 WITH Operand::Operand::getDef_dispred#3#ff AS R ON FIRST 1 OUTPUT R.<1>, r4.<1>
    26850      ~1%     {2} r6 = JOIN r5 WITH Instruction::Instruction::getUnconvertedResultExpression_dispred#ff AS R ON FIRST 1 OUTPUT R.<1>, r5.<1>
    58162      ~0%     {2} r7 = SCAN DataFlowUtil::DefinitionByReferenceNode#class#ff AS I OUTPUT I.<1>, I.<0>
    58162      ~4%     {3} r8 = JOIN r7 WITH Instruction::IndexedInstruction#ff AS R ON FIRST 1 OUTPUT R.<1>, r7.<1>, r7.<0>
    4026581120 ~0%     {4} r9 = JOIN r8 WITH Instruction::CallInstruction::getPositionalArgumentOperand_dispred#fff_102#join_rhs AS R ON FIRST 1 OUTPUT r8.<2>, R.<1>, r8.<1>, R.<2>
    31154      ~4%     {2} r10 = JOIN r9 WITH Instruction::SideEffectInstruction::getPrimaryInstruction_dispred#3#ff AS R ON FIRST 2 OUTPUT r9.<3>, r9.<2>
    31154      ~8%     {2} r11 = JOIN r10 WITH Operand::Operand::getDef_dispred#3#ff AS R ON FIRST 1 OUTPUT R.<1>, r10.<1>
    31154      ~0%     {2} r12 = JOIN r11 WITH Instruction::Instruction::getUnconvertedResultExpression_dispred#ff AS R ON FIRST 1 OUTPUT R.<1>, r11.<1>
    58004      ~0%     {2} r13 = r6 \/ r12
                       return r13

Tuple counts after:

    (0s) Tuple counts for DataFlowUtil::definitionByReferenceNodeFromArgument#ff:
    385785  ~6%     {2} r1 = SCAN DataFlowUtil::DefinitionByReferenceNode#class#ff AS I OUTPUT I.<1>, I.<0>
    385785  ~0%     {3} r2 = JOIN r1 WITH Instruction::IndexedInstruction#ff AS R ON FIRST 1 OUTPUT r1.<0>, r1.<1>, R.<1>
    385785  ~1%     {3} r3 = JOIN r2 WITH Instruction::SideEffectInstruction::getPrimaryInstruction_dispred#3#ff AS R ON FIRST 1 OUTPUT R.<1>, r2.<2>, r2.<1>
    198736  ~4%     {2} r4 = JOIN r3 WITH Instruction::CallInstruction::getPositionalArgument#fff AS R ON FIRST 2 OUTPUT R.<2>, r3.<2>
    198736  ~0%     {2} r5 = JOIN r4 WITH Instruction::Instruction::getUnconvertedResultExpression_dispred#ff AS R ON FIRST 1 OUTPUT R.<1>, r4.<1>
    385785  ~1%     {3} r6 = SCAN DataFlowUtil::DefinitionByReferenceNode#class#ff AS I OUTPUT I.<1>, -1, I.<0>
    186891  ~1%     {2} r7 = JOIN r6 WITH Instruction::IndexedInstruction#ff AS R ON FIRST 2 OUTPUT r6.<0>, r6.<2>
    186891  ~2%     {2} r8 = JOIN r7 WITH Instruction::SideEffectInstruction::getPrimaryInstruction_dispred#3#ff AS R ON FIRST 1 OUTPUT R.<1>, r7.<1>
    183201  ~3%     {2} r9 = JOIN r8 WITH Instruction::CallInstruction::getThisArgumentOperand_dispred#ff AS R ON FIRST 1 OUTPUT R.<1>, r8.<1>
    183201  ~0%     {2} r10 = JOIN r9 WITH Operand::Operand::getDef_dispred#3#ff AS R ON FIRST 1 OUTPUT R.<1>, r9.<1>
    175449  ~8%     {2} r11 = JOIN r10 WITH Instruction::Instruction::getUnconvertedResultExpression_dispred#ff AS R ON FIRST 1 OUTPUT R.<1>, r10.<1>
    374185  ~3%     {2} r12 = r5 \/ r11
                    return r12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

C# C++ depends on internal PR This PR should only be merged in sync with an internal Semmle PR

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants