RFC - framework data-flow packs #4443

hvitved · 2020-10-09T08:47:16Z

DO NOT MERGE: This a RFC PR.

This PR attempts to introduce a means of specifying framework data-flow packs, which for now means flow-summaries, sources, sinks (and content-clearing callables). The hope is that we can agree on similar interfaces for languages that use the shared data-flow library (and perhaps at some point shared implementation of the machinery that interprets the framework models).

The following entities will be language-specific (but there will likely be much overlap):

SummaryInput: The type of inputs that can be specified in flow summaries.
SummaryOutput: The type of outputs that can be specified in flow summaries.
FlowSource: The various kinds of flow sources.
FlowSink: The various kinds of flow sinks.

The reason for defining FrameworkDataFlow as an abstract class that extends string, as opposed to a unit type, is that we can use it to provide some numbers on how much coverage we have for a given framework, e.g.

int countCallables(FrameworkDataFlow framework) {
  result =
    count(Callable c | framework.hasSummary(c, _, _, _, _, _) or framework.clearsContent(c, _, _))
}

jbj · 2020-10-09T09:24:17Z

The C/C++ side of this problem has been discussed so far in https://github.com/github/codeql-c-analysis-team/issues/153

calumgrant

Interesting! I love the idea of a more consistent and uniform way to express library flow for all the languages, and I prefer the idea of grouping them by framework than by role.

csharp/ql/src/semmle/code/csharp/dataflow/FrameworkDataFlow.qll

calumgrant · 2020-10-09T11:58:57Z

csharp/ql/src/semmle/code/csharp/dataflow/FrameworkDataFlow.qll

+
+  /** Data flow for `System.Text.StringBuilder`. */
+  class StringBuilderDataFlow extends FrameworkDataFlow {
+    StringBuilderDataFlow() { this = "System.Text.StringBuilder" }


Is this string useful in any way? Would you ever need to limit data flow to just class StringBuilderDataFlow and "System.Text.StringBuilder" Because if it's never used, then it just becomes a predicate parameter we are carrying around for no good reason, and we can avoid the configuration-based approach, and instead write

// SummaryFlow is abstract and extends SourceDeclarationCallable class SystemTextBuilderFlow extends SummaryFlow { override predicate hasFlow(SummaryInput input, ContentStack inputAp, SummaryOutput output, ContentStack outputAp) { this = any(SystemTextStringBuilderClass c).getAMember() and ... } }

I'm merely posting this as a strawman.

Isn't this the answer? (From the PR description)

The reason for defining FrameworkDataFlow as an abstract class that extends string, as opposed to a unit type, is that we can use it to provide some numbers on how much coverage we have for a given framework, e.g.

int countCallables(FrameworkDataFlow framework) { result = count(Callable c | framework.hasSummary(c, _, _, _, _, _) or framework.clearsContent(c, _, _)) }

I think it can be useful, for example if we want to have numbers on how much coverage we have for a given framework (see PR description for an example). Once the summaries have been compiled down to atomic steps we will no longer carry around the string column.

Would you call "System.Text.StringBuilder" a framework? I don't know what's traditional for C#, but I'd think of the whole C# standard library as one framework. For finer granularity you might call "System.Text" a framework, but I don't see who'd be interested in the information at class-level granularity.

In any case, you can always recover class-level granularity without these manually-added text strings: find all flow through callables and group by the enclosing class of those callables.

I wouldn't call stringbuilder a framework, so methodology-wise it was a bad example, I merely included it to show examples of flow summaries.

I'm not convinced that the framework string is going to be useful. A class or method can only be declared as belonging to a framework when it carries data flow. There's no way to state that a class is part of some framework but has no data flow. If that count(...) example tells us that a certain snapshot has 20 classes from the ASP.NET framework modelled, is that a lot or a little? It would depend on the number of classes from ASP.NET that are not modelled.

If we can write one line of QL centrally to declare that all classes under the System.Web hierarchy belong to the ASP.NET framework, then we'll have described both the classes that have data-flow models and the classes that don't. That seems both more general and more brief than repeating the ASP.NET name in every data-flow model of a System.Web class.

We used to have the Tech Inventory feature to figure out which elements come from which frameworks. I'm sure that feature can still be brought to life if we want.

I agree that the string may not be relevant for flow summaries, but for sources and sinks I don't think we can necessarily map them easily back to their framework.

csharp/ql/src/semmle/code/csharp/dataflow/FrameworkDataFlow.qll

intrigus-lgtm · 2020-10-09T12:38:11Z

csharp/ql/src/semmle/code/csharp/dataflow/FrameworkDataFlow.qll

+  predicate hasSink(DataFlow::Node n, FlowSink sink) { none() }
+}
+
+module Examples {


I would very much prefer a full example.
I.e. given some code in C#/Java/..., how does one create an appropriate extension of the FrameworkDataFlow class.
As an external contributor I find all this SummaryInput, ContentStack, SummaryOutput stuff far more complex than what I've done before (defining new data flow sinks, etc..).

What is missing from the examples that I have already included? It shows to to define data-flow models for StringBuilders, Lazy<T>s, and IDbCommand. Note that what we are modelling is the case when we are using a library that we don't have the source code for.

Well, I don't really understand what a SummaryInput or ContentStack is/does.

The QlDoc for both classes does not help someone that does not already understand everything.
E.g. what even is a "flow-summary"?
For example:

private predicate constructorFlow( Constructor c, SummaryInput input, ContentList inputAp, SummaryOutput output, ContentList outputAp ) { c = this.getClass().getAMember() and c.getParameter(0).getType() instanceof StringType and input = TArgumentSummaryInput(0) and inputAp = ContentList::empty() and output = TReturnSummaryOutput() and outputAp = ContentList::element() }

Does this mean that there is flow from an input that is the 0-th argument, to the return value of the constructor?
How does this relate to inputAp and outputAp?

I was hoping that the QL doc for hasSummary was sufficient:

/** * Holds if data may flow from `input` to `output` when calling `c`. * * `inputContents` describes the contents that is popped from the access * path from the input and `outputContents` describes the contents that * is pushed onto the resulting access path. * * `preservesValue` indicates whether this is a value-preserving step * or a taint-step. */

We can't really give examples of what input and output might be, as those are language-specific.

csharp/ql/src/semmle/code/csharp/dataflow/FrameworkDataFlow.qll

jbj

I'll have deeper feedback after talking to the team this afternoon.

jbj · 2020-10-12T13:54:27Z

csharp/ql/src/semmle/code/csharp/dataflow/FrameworkDataFlow.qll

+
+  /** Data flow for `System.Text.StringBuilder`. */
+  class StringBuilderDataFlow extends FrameworkDataFlow {
+    StringBuilderDataFlow() { this = "System.Text.StringBuilder" }


Would you call "System.Text.StringBuilder" a framework? I don't know what's traditional for C#, but I'd think of the whole C# standard library as one framework. For finer granularity you might call "System.Text" a framework, but I don't see who'd be interested in the information at class-level granularity.

In any case, you can always recover class-level granularity without these manually-added text strings: find all flow through callables and group by the enclosing class of those callables.

csharp/ql/src/semmle/code/csharp/dataflow/FrameworkDataFlow.qll

csharp/ql/src/semmle/code/csharp/dataflow/internal/FrameworkDataFlowImpl.qll

jbj · 2020-10-12T14:39:54Z

csharp/ql/src/semmle/code/csharp/dataflow/FrameworkDataFlow.qll

+
+  private class RemoteFlowSink extends Range {
+    RemoteFlowSink() { this = "remote" }
+  }


What's the correspondence between FlowSink::Range and the query suite? Today we roughly have one query per sink type, right? One for HTML, one for SQL, and so on. That's because each sink type needs a separate qhelp file, and each needs its own set of sanitizers.

Is FlowSink::Range really a "flow label"? I'd think it should go together with a group of sanitizers, which is also why I'd think these classes should be public. But maybe I'm asking these questions because I don't understand what these classes will be used for.

It is true that many of the FlowSinks will be one-to-one with queries, but not all (for example the remote flow sink is useful for all exposure queries). The reason that RemoteFlowSink is not public is that it uses the underlying string representation; the predicate remote() on the other hand is public. So Range is only meant to be used when defining new kinds of flow sources/sinks.

The idea is that we can use these sources/sinks to redefine for example RemoteFlowSource:

class RemoteFlowSource extends Node { RemoteFlowSource() { any(FrameworkDataFlow fdf).hasSink(this, FlowSink::remote()) } }

Perhaps RemoteFlowSink should be called RemoteFlowSinkRange. The only reason I use the range-setup is to reduce string pitfalls; one can, of course, still construct an invalid string, but it is much more cumbersome (first you need to extend Range and then call toFlowSink() to get a value that wraps the invalid string).

I also forgot to apply isSubsetOf above, so it should be

class RemoteFlowSink extends Node { RemoteFlowSink() { any(FrameworkDataFlow fdf).hasSink(this, any(FlowSink fs | fs.isSubsetOf*(FlowSink::remote())) } }

jbj · 2020-10-12T14:41:00Z

csharp/ql/src/semmle/code/csharp/dataflow/FrameworkDataFlow.qll

+  private class HtmlFlowSink extends Range {
+    HtmlFlowSink() { this = "html" }
+
+    override predicate isSubsetOf(FlowSink fs) { fs = remote() }


Subsetting of classes usually goes via QL's class hierarchy mechanism. Is it necessary to build a custom system?

I couldn't come up with a good way to do it. For example, I want to have flow sinks Email, Html, and Remote, where Email and Html are also Remote flow sinks, but not vice versa. Making Remote an abstract class does not work, because we want to be able to model remote flow sinks that are neither Email nor Html.

I don't understand what you're saying from "not vice versa" and on. If we want a remote flow sink that's not Email or Html then we extend the abstract class Remote, and we have it.

That would probably mean adding very specific flow sinks, or perhaps just a NonOfTheOtherRemoteFlowSink, but perhaps that is ok.

jbj

Here are some more comments I have after discussing with @github/codeql-c-analysis yesterday.

csharp/ql/src/semmle/code/csharp/dataflow/internal/FrameworkDataFlowImpl.qll

csharp/ql/src/semmle/code/csharp/dataflow/FrameworkDataFlow.qll

jbj · 2020-10-13T10:09:54Z

csharp/ql/src/semmle/code/csharp/dataflow/FrameworkDataFlow.qll

+
+  /** Data flow for `System.Text.StringBuilder`. */
+  class StringBuilderDataFlow extends FrameworkDataFlow {
+    StringBuilderDataFlow() { this = "System.Text.StringBuilder" }


I'm not convinced that the framework string is going to be useful. A class or method can only be declared as belonging to a framework when it carries data flow. There's no way to state that a class is part of some framework but has no data flow. If that count(...) example tells us that a certain snapshot has 20 classes from the ASP.NET framework modelled, is that a lot or a little? It would depend on the number of classes from ASP.NET that are not modelled.

If we can write one line of QL centrally to declare that all classes under the System.Web hierarchy belong to the ASP.NET framework, then we'll have described both the classes that have data-flow models and the classes that don't. That seems both more general and more brief than repeating the ASP.NET name in every data-flow model of a System.Web class.

We used to have the Tech Inventory feature to figure out which elements come from which frameworks. I'm sure that feature can still be brought to life if we want.

jbj · 2020-10-13T10:15:15Z

csharp/ql/src/semmle/code/csharp/dataflow/FrameworkDataFlow.qll

+  private import semmle.code.csharp.frameworks.system.Text
+
+  /** Data flow for `System.Text.StringBuilder`. */
+  class StringBuilderDataFlow extends FrameworkDataFlow {


In C/C++ modelling, we extend Function rather than a string or unit type when we want to model a Function. On that extension we can model all aspects of the function: data flow, taint, aliasing behaviour, escaping parameters, exception safety, etc. We're not sure whether this approach of having many QL classes for one C++ class is better or worse. It seems worse in that there's more class-declaration boilerplate to write when modelling just one function on a class, but it seems better in that multiple functions can be modelled in a single override of hasSummary, and this might overall save on boilerplate.

We're not sure whether this approach of having many QL classes for one C++ class is better or worse

Would it help if you simply extended the FrameworkDataFlow class with the extra features that you need for C++? I.e. instead of

class FrameworkDataFlow = Impl::FrameworkDataFlow;

you would have

abstract class FrameworkDataFlow extends Impl::FrameworkDataFlow { predicate neverEscapes(Function f, SummaryInput input) { none() } // ... }

I don't think that's practical. It would require effectively copying all members from all model types into FrameworkDataFlow. But why should all models be redirected through the data-flow model? I think such a setup only makes sense when data flow is the only model you have, or at least when everything else is secondary.

jbj · 2020-10-13T15:00:10Z

csharp/ql/src/semmle/code/csharp/dataflow/internal/FrameworkDataFlowImpl.qll

+  /** Holds if `n` is a flow source of type `source. */
+  pragma[nomagic]
+  predicate hasSource(Node n, FlowSource source) { none() }
+
+  /** Holds if `n` is a flow sink of type `sink. */
+  pragma[nomagic]
+  predicate hasSink(Node n, FlowSink sink) { none() }


Why are sources and sinks specified in terms of Node when summaries are specified in terms of callables, inputs, and outputs? If the input/output specification isn't good enough four sources and sinks, then how can it be good enough for additional flow steps?

In C/C++ we avoid Node in models. That way our models aren't coupled to the data-flow library, which is both architecturally clean and also a necessity as long as we have two data-flow libraries.

Those are some very valid points, perhaps something like this would work instead?

/** Holds if output of type `output` from `c` is a flow source of type `source`. */ predicate hasSource(FrameworkCallable c, SummaryOutput output, FlowSource source) { none() }

and similar for hasSink. That, however, restricts sources and sinks to output from/input to callables, but perhaps that will cover most cases.

…languages

github-actions bot added the C# label Oct 9, 2020

hvitved added C++ Java Python labels Oct 9, 2020

calumgrant reviewed Oct 9, 2020

View reviewed changes

intrigus-lgtm reviewed Oct 9, 2020

View reviewed changes

smowton reviewed Oct 9, 2020

View reviewed changes

jbj reviewed Oct 12, 2020

View reviewed changes

jbj reviewed Oct 13, 2020

View reviewed changes

hvitved added 5 commits October 15, 2020 14:44

RFC - framework data-flow packs

95f85c3

Restructure framework data-flow packs to make them reusable by other …

8825098

…languages

Add hasSummary predicate without input/output contents

e57aaa6

Redefine hasSource and hasSink

9c543cd

C#-specific changes

32d8bb9

hvitved force-pushed the csharp/dataflow/frameworks branch from 399527f to 9117e32 Compare October 15, 2020 18:47

Shared flow-summary compilation

22a0714

hvitved force-pushed the csharp/dataflow/frameworks branch from 9117e32 to 22a0714 Compare October 16, 2020 06:57

Remove isSubsetOf from FlowSource/FlowSink

b34f5aa

hvitved mentioned this pull request Oct 16, 2020

C#: Shared interface/implementation for flow summaries #4495

Merged

hvitved closed this Oct 16, 2020

hvitved deleted the csharp/dataflow/frameworks branch October 16, 2020 14:02

hvitved restored the csharp/dataflow/frameworks branch October 16, 2020 14:02

Dec	JAN	Feb
	07
2025	2026	2027

RFC - framework data-flow packs #4443

RFC - framework data-flow packs #4443

Uh oh!

Conversation

hvitved commented Oct 9, 2020

Uh oh!

jbj commented Oct 9, 2020

Uh oh!

calumgrant left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jbj left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

hvitved Oct 13, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jbj left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

hvitved Oct 13, 2020 •

edited

Loading