-
Notifications
You must be signed in to change notification settings - Fork 1.9k
RFC - framework data-flow packs #4443
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
The C/C++ side of this problem has been discussed so far in https://github.com/github/codeql-c-analysis-team/issues/153 |
calumgrant
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interesting! I love the idea of a more consistent and uniform way to express library flow for all the languages, and I prefer the idea of grouping them by framework than by role.
csharp/ql/src/semmle/code/csharp/dataflow/FrameworkDataFlow.qll
Outdated
Show resolved
Hide resolved
|
|
||
| /** Data flow for `System.Text.StringBuilder`. */ | ||
| class StringBuilderDataFlow extends FrameworkDataFlow { | ||
| StringBuilderDataFlow() { this = "System.Text.StringBuilder" } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this string useful in any way? Would you ever need to limit data flow to just class StringBuilderDataFlow and "System.Text.StringBuilder" Because if it's never used, then it just becomes a predicate parameter we are carrying around for no good reason, and we can avoid the configuration-based approach, and instead write
// SummaryFlow is abstract and extends SourceDeclarationCallable
class SystemTextBuilderFlow extends SummaryFlow
{
override predicate hasFlow(SummaryInput input, ContentStack inputAp, SummaryOutput output,
ContentStack outputAp) {
this = any(SystemTextStringBuilderClass c).getAMember() and
...
}
}
I'm merely posting this as a strawman.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't this the answer? (From the PR description)
The reason for defining
FrameworkDataFlowas an abstract class that extendsstring, as opposed to a unit type, is that we can use it to provide some numbers on how much coverage we have for a given framework, e.g.int countCallables(FrameworkDataFlow framework) { result = count(Callable c | framework.hasSummary(c, _, _, _, _, _) or framework.clearsContent(c, _, _)) }
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it can be useful, for example if we want to have numbers on how much coverage we have for a given framework (see PR description for an example). Once the summaries have been compiled down to atomic steps we will no longer carry around the string column.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would you call "System.Text.StringBuilder" a framework? I don't know what's traditional for C#, but I'd think of the whole C# standard library as one framework. For finer granularity you might call "System.Text" a framework, but I don't see who'd be interested in the information at class-level granularity.
In any case, you can always recover class-level granularity without these manually-added text strings: find all flow through callables and group by the enclosing class of those callables.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wouldn't call stringbuilder a framework, so methodology-wise it was a bad example, I merely included it to show examples of flow summaries.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not convinced that the framework string is going to be useful. A class or method can only be declared as belonging to a framework when it carries data flow. There's no way to state that a class is part of some framework but has no data flow. If that count(...) example tells us that a certain snapshot has 20 classes from the ASP.NET framework modelled, is that a lot or a little? It would depend on the number of classes from ASP.NET that are not modelled.
If we can write one line of QL centrally to declare that all classes under the System.Web hierarchy belong to the ASP.NET framework, then we'll have described both the classes that have data-flow models and the classes that don't. That seems both more general and more brief than repeating the ASP.NET name in every data-flow model of a System.Web class.
We used to have the Tech Inventory feature to figure out which elements come from which frameworks. I'm sure that feature can still be brought to life if we want.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that the string may not be relevant for flow summaries, but for sources and sinks I don't think we can necessarily map them easily back to their framework.
| predicate hasSink(DataFlow::Node n, FlowSink sink) { none() } | ||
| } | ||
|
|
||
| module Examples { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would very much prefer a full example.
I.e. given some code in C#/Java/..., how does one create an appropriate extension of the FrameworkDataFlow class.
As an external contributor I find all this SummaryInput, ContentStack, SummaryOutput stuff far more complex than what I've done before (defining new data flow sinks, etc..).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is missing from the examples that I have already included? It shows to to define data-flow models for StringBuilders, Lazy<T>s, and IDbCommand. Note that what we are modelling is the case when we are using a library that we don't have the source code for.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, I don't really understand what a SummaryInput or ContentStack is/does.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The QlDoc for both classes does not help someone that does not already understand everything.
E.g. what even is a "flow-summary"?
For example:
private predicate constructorFlow(
Constructor c, SummaryInput input, ContentList inputAp, SummaryOutput output,
ContentList outputAp
) {
c = this.getClass().getAMember() and
c.getParameter(0).getType() instanceof StringType and
input = TArgumentSummaryInput(0) and
inputAp = ContentList::empty() and
output = TReturnSummaryOutput() and
outputAp = ContentList::element()
}Does this mean that there is flow from an input that is the 0-th argument, to the return value of the constructor?
How does this relate to inputAp and outputAp?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was hoping that the QL doc for hasSummary was sufficient:
/**
* Holds if data may flow from `input` to `output` when calling `c`.
*
* `inputContents` describes the contents that is popped from the access
* path from the input and `outputContents` describes the contents that
* is pushed onto the resulting access path.
*
* `preservesValue` indicates whether this is a value-preserving step
* or a taint-step.
*/We can't really give examples of what input and output might be, as those are language-specific.
csharp/ql/src/semmle/code/csharp/dataflow/FrameworkDataFlow.qll
Outdated
Show resolved
Hide resolved
csharp/ql/src/semmle/code/csharp/dataflow/FrameworkDataFlow.qll
Outdated
Show resolved
Hide resolved
csharp/ql/src/semmle/code/csharp/dataflow/FrameworkDataFlow.qll
Outdated
Show resolved
Hide resolved
jbj
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll have deeper feedback after talking to the team this afternoon.
|
|
||
| /** Data flow for `System.Text.StringBuilder`. */ | ||
| class StringBuilderDataFlow extends FrameworkDataFlow { | ||
| StringBuilderDataFlow() { this = "System.Text.StringBuilder" } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would you call "System.Text.StringBuilder" a framework? I don't know what's traditional for C#, but I'd think of the whole C# standard library as one framework. For finer granularity you might call "System.Text" a framework, but I don't see who'd be interested in the information at class-level granularity.
In any case, you can always recover class-level granularity without these manually-added text strings: find all flow through callables and group by the enclosing class of those callables.
csharp/ql/src/semmle/code/csharp/dataflow/internal/FrameworkDataFlowImpl.qll
Show resolved
Hide resolved
|
|
||
| private class RemoteFlowSink extends Range { | ||
| RemoteFlowSink() { this = "remote" } | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the correspondence between FlowSink::Range and the query suite? Today we roughly have one query per sink type, right? One for HTML, one for SQL, and so on. That's because each sink type needs a separate qhelp file, and each needs its own set of sanitizers.
Is FlowSink::Range really a "flow label"? I'd think it should go together with a group of sanitizers, which is also why I'd think these classes should be public. But maybe I'm asking these questions because I don't understand what these classes will be used for.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is true that many of the FlowSinks will be one-to-one with queries, but not all (for example the remote flow sink is useful for all exposure queries). The reason that RemoteFlowSink is not public is that it uses the underlying string representation; the predicate remote() on the other hand is public. So Range is only meant to be used when defining new kinds of flow sources/sinks.
The idea is that we can use these sources/sinks to redefine for example RemoteFlowSource:
class RemoteFlowSource extends Node {
RemoteFlowSource() { any(FrameworkDataFlow fdf).hasSink(this, FlowSink::remote()) }
}There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps RemoteFlowSink should be called RemoteFlowSinkRange. The only reason I use the range-setup is to reduce string pitfalls; one can, of course, still construct an invalid string, but it is much more cumbersome (first you need to extend Range and then call toFlowSink() to get a value that wraps the invalid string).
I also forgot to apply isSubsetOf above, so it should be
class RemoteFlowSink extends Node {
RemoteFlowSink() {
any(FrameworkDataFlow fdf).hasSink(this, any(FlowSink fs | fs.isSubsetOf*(FlowSink::remote()))
}
}| private class HtmlFlowSink extends Range { | ||
| HtmlFlowSink() { this = "html" } | ||
|
|
||
| override predicate isSubsetOf(FlowSink fs) { fs = remote() } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Subsetting of classes usually goes via QL's class hierarchy mechanism. Is it necessary to build a custom system?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I couldn't come up with a good way to do it. For example, I want to have flow sinks Email, Html, and Remote, where Email and Html are also Remote flow sinks, but not vice versa. Making Remote an abstract class does not work, because we want to be able to model remote flow sinks that are neither Email nor Html.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand what you're saying from "not vice versa" and on. If we want a remote flow sink that's not Email or Html then we extend the abstract class Remote, and we have it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That would probably mean adding very specific flow sinks, or perhaps just a NonOfTheOtherRemoteFlowSink, but perhaps that is ok.
jbj
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here are some more comments I have after discussing with @github/codeql-c-analysis yesterday.
csharp/ql/src/semmle/code/csharp/dataflow/internal/FrameworkDataFlowImpl.qll
Show resolved
Hide resolved
csharp/ql/src/semmle/code/csharp/dataflow/internal/FrameworkDataFlowImpl.qll
Show resolved
Hide resolved
|
|
||
| /** Data flow for `System.Text.StringBuilder`. */ | ||
| class StringBuilderDataFlow extends FrameworkDataFlow { | ||
| StringBuilderDataFlow() { this = "System.Text.StringBuilder" } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not convinced that the framework string is going to be useful. A class or method can only be declared as belonging to a framework when it carries data flow. There's no way to state that a class is part of some framework but has no data flow. If that count(...) example tells us that a certain snapshot has 20 classes from the ASP.NET framework modelled, is that a lot or a little? It would depend on the number of classes from ASP.NET that are not modelled.
If we can write one line of QL centrally to declare that all classes under the System.Web hierarchy belong to the ASP.NET framework, then we'll have described both the classes that have data-flow models and the classes that don't. That seems both more general and more brief than repeating the ASP.NET name in every data-flow model of a System.Web class.
We used to have the Tech Inventory feature to figure out which elements come from which frameworks. I'm sure that feature can still be brought to life if we want.
| private import semmle.code.csharp.frameworks.system.Text | ||
|
|
||
| /** Data flow for `System.Text.StringBuilder`. */ | ||
| class StringBuilderDataFlow extends FrameworkDataFlow { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In C/C++ modelling, we extend Function rather than a string or unit type when we want to model a Function. On that extension we can model all aspects of the function: data flow, taint, aliasing behaviour, escaping parameters, exception safety, etc. We're not sure whether this approach of having many QL classes for one C++ class is better or worse. It seems worse in that there's more class-declaration boilerplate to write when modelling just one function on a class, but it seems better in that multiple functions can be modelled in a single override of hasSummary, and this might overall save on boilerplate.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We're not sure whether this approach of having many QL classes for one C++ class is better or worse
Would it help if you simply extended the FrameworkDataFlow class with the extra features that you need for C++? I.e. instead of
class FrameworkDataFlow = Impl::FrameworkDataFlow;you would have
abstract class FrameworkDataFlow extends Impl::FrameworkDataFlow {
predicate neverEscapes(Function f, SummaryInput input) { none() }
// ...
}There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think that's practical. It would require effectively copying all members from all model types into FrameworkDataFlow. But why should all models be redirected through the data-flow model? I think such a setup only makes sense when data flow is the only model you have, or at least when everything else is secondary.
| /** Holds if `n` is a flow source of type `source. */ | ||
| pragma[nomagic] | ||
| predicate hasSource(Node n, FlowSource source) { none() } | ||
|
|
||
| /** Holds if `n` is a flow sink of type `sink. */ | ||
| pragma[nomagic] | ||
| predicate hasSink(Node n, FlowSink sink) { none() } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are sources and sinks specified in terms of Node when summaries are specified in terms of callables, inputs, and outputs? If the input/output specification isn't good enough four sources and sinks, then how can it be good enough for additional flow steps?
In C/C++ we avoid Node in models. That way our models aren't coupled to the data-flow library, which is both architecturally clean and also a necessity as long as we have two data-flow libraries.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Those are some very valid points, perhaps something like this would work instead?
/** Holds if output of type `output` from `c` is a flow source of type `source`. */
predicate hasSource(FrameworkCallable c, SummaryOutput output, FlowSource source) { none() }and similar for hasSink. That, however, restricts sources and sinks to output from/input to callables, but perhaps that will cover most cases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
399527f to
9117e32
Compare
9117e32 to
22a0714
Compare


DO NOT MERGE: This a RFC PR.
This PR attempts to introduce a means of specifying framework data-flow packs, which for now means flow-summaries, sources, sinks (and content-clearing callables). The hope is that we can agree on similar interfaces for languages that use the shared data-flow library (and perhaps at some point shared implementation of the machinery that interprets the framework models).
The following entities will be language-specific (but there will likely be much overlap):
SummaryInput: The type of inputs that can be specified in flow summaries.SummaryOutput: The type of outputs that can be specified in flow summaries.FlowSource: The various kinds of flow sources.FlowSink: The various kinds of flow sinks.The reason for defining
FrameworkDataFlowas an abstract class that extendsstring, as opposed to a unit type, is that we can use it to provide some numbers on how much coverage we have for a given framework, e.g.