Queries

Table of Contents

  1. Queries (Overview)
  2. FROM Clause
  3. MATCH Expression
  4. WITH Clause
  5. LET Clause
  6. WHERE Clause
  7. GROUP BY Clause
  8. HAVING Clause

Queries (Overview)

A query can either be an expression (whose composition remains unchanged from SQL++), or a construction of query blocks. A query block may contain several clauses, including SELECT, FROM, LET, WHERE, GROUP BY, and HAVING. The following productions are also unchanged from SQL++.


Query

Query Production Diagram


Selection

Selection Production Diagram


Query Block

Query Block Diagram


Stream Generator

Stream Generator Diagram


Similar to SQL++ (but unlike SQL), gSQL++ allows the SELECT clause to appear either at the beginning or the end of a query black. For some queries, placing the SELECT clause at the end may make a query block easier to understand because the SELECT clause refers to variables defined in the stream generator production.

FROM Clause

The purpose of a FROM clause is to iterate over a collection, binding a variable to each item in turn. Graphix (and gSQL++) do not change the purpose of the FROM clause, rather the gSQL++ extension gives users an additional method to specify variable bindings (in the form of vertices, edges, and paths). As shown in the FromTerm production, gSQL++ users are free to interleave their graph bindings with existing SQL++ document bindings to express truly synergistic document-based and graph-based queries.


FROM Clause

FROM Clause Diagram


From Term

From Term Diagram


Qualified Name

Qualified Name Diagram


Named Expression (NamedExpr)

Named Expression Diagram


JOIN Step

JOIN Step Diagram


UNNEST Step

UNNEST Step Diagram


MATCH Step

MATCH Step Diagram


Below is a query that illustrates the usage of the gSQL++ extension to iterate over mappings of a single-edge pattern to our managed graph GelpGraph.

FROM    
    GRAPH GelpGraph
        (u1:User)-[:FRIENDS_WITH]->(u2:User)
SELECT  
    u1, 
    u2;

Putting aside the good practice of specifying explicit iteration variables in SQL++, the problem of specifying graph query patterns in nearly all non-trivial use cases involves describing more than one graph element. Consequently, gSQL++ does not support implicit iteration variables for variables defined in MatchExpr and MatchStep productions. This lack of support contrasts SQL++, where one-dataset queries are (arguably) more common.

MATCH Expression

The purpose of a PatternExpr is to specify a (potentially navigational) graph pattern and introduce all mapping permutations of the graph pattern to the underlying data.


MATCH Expression (MatchExpr)

MATCH Expression Diagram


Pattern Expression (PatternExpr)

Pattern Expression Diagram


Vertex Pattern

Vertex Pattern Diagram


Vertex Detail

Vertex Detail Diagram


Edge Pattern

Edge Pattern Diagram


Edge Detail

Edge Detail Diagram


Path Pattern

Path Pattern Diagram


Path Detail

Path Detail Diagram


Label Set

Label Set Diagram


Repetition Quantifier

Repetition Quantifier Diagram


The following query illustrates a basic graph pattern (BGP) matching query, which finds all users that have written a review.

FROM    
    GRAPH GelpGraph
        (r:Review)<-[mb:MADE_BY]-(u:User)
SELECT DISTINCT 
    u;

In the example above, two vertices and one edge are specified in the graph pattern. An edge must always connect exactly two vertices, but a vertex can be specified without an edge.

Graph patterns are implicitly joined with one another if they share a vertex variable. Such style may be useful for improving the readability of your query. The example below shows two graph patterns: the first of which finds all users that have written a review, and the second of which finds all users and their friends.

FROM    
    GRAPH GelpGraph
        (r:Review)<-[mb:MADE_BY]-(u1:User),
        (u1)-[:FRIENDS_WITH]->(u2:User)
SELECT DISTINCT 
    u1;

Conceptually, all permutations of both patterns are then joined on their common vertices. If a pattern does not share any vertices with any other pattern in that specific FromTerm, then we say that our pattern is disjoint. Disjoint patterns are analogous to cartesian products in SQL: the binding tuple stream after the FromTerm contains all possible pairs of the disjoint patterns.

Vertices and edges are the core of all graph queries, but often we may want to reason about our graph at the level of paths. Paths are a collection of edges, and can be specified in gSQL++ in a similar manner to an edge pattern. The difference with paths, however, is the use of a repetition quantifier after the variable and label specification. Paths are described in Graphix using a regular expression of edge labels. In the query below, we assign a variable of p to a path consisting of only 1 to 3 FRIENDS_WITH edges.

FROM    
    GRAPH GelpGraph
        (u1:User)<-[p:FRIENDS_WITH{1,3}]-(u2:User)
SELECT VALUE  
    PATH_VERTICES(p);

Similar to SQL, gSQL++ offers optional binding to patterns with the LEFT MATCH clause. The following example asks for users and their friends, as well as reviews if they have any.

FROM       
    GRAPH GelpGraph
        (u1:User)-[:FRIENDS_WITH]->(u2:User)
    LEFT MATCH 
        (u1)<-[mb:MADE_BY]-(r:Review)
SELECT     
    u1, 
    u2, 
    mb, 
    r;

gSQL++ will only bind to variables declared in a LEFT MATCH clause if (and only if) the corresponding pattern can be matched in full. In the example above, if there was a mb edge record that was connected to u1 but not to any existing user, then MISSING would be bound to both mb and r for that instance. The flexibility of AsterixDB’s data model means that edges in Graphix may not be “consistent”. gSQL++ will always work in units of patterns, not individual collections.

Features such as edge label alternation and negation are planned, but not implemented yet.

WITH Clause

Similar to SQL and SQL++, WITH clauses are used to improve the modularity of one’s query. Typically, a WITH clause will contain a subquery whose results will be used elsewhere in the query (although the expression bound with a WITH doesn’t explicitly have to a QueryBlock production). In gSQL++, users can also use a WITH clause to define a temporary graph. WITH GRAPH is the alternative method used to define graphs (compared to CREATE GRAPH), and should be used when a graph only needs to be known in the context of a single query.


WITH Clause

WITH Clause Diagram


WITH Term

WITH Term Diagram


The following example defines a temporary graph called NewGelpGraph that reuses the FRIENDS_WITH edge definition from GelpGraph to add a weight attribute that depends on the number of reviews each user has written.

WITH
    GRAPH NewGelpGraph AS
        VERTEX (:User)
            PRIMARY KEY (user_id)
            AS Gelp.Users,
        EDGE (:User)-[:FRIENDS_WITH]->(:User)
            SOURCE KEY      (user_id)
            DESTINATION KEY (friend)
            AS ( 
                FROM
                    GRAPH YelpGraph
                        (u1:User)-[:KNOWS]->(u2:User),
                        (u1)-[:MADE]->(r1:Review),
                        (u2)-[:MADE]->(r2:Review)
                GROUP BY
                    u1.user_id,
                    u2.user_id
                LET
                    u1_reviews = COUNT(DISTINCT r1),
                    u2_reviews = COUNT(DISTINCT r2)
                SELECT
                    u1.user_id              AS user_id,
                    u2.user_id              AS friend,
                    u1_reviews + u2_reviews AS weight
            )
FROM     
    GRAPH NewGelpGraph
        (u1:User)-[fw]->(u2:User)
WHERE
    fw.weight > 10
SELECT
    u1,
    u2;

As alluded to in the GROUP BY section here, this weight attribute can later be used to define custom attributes on-the-fly to build cheapest-path queries.

LET Clause

LET clauses serve the same purpose in gSQL++ as they do in SQL++: for specifying an expression once, but referring to the expression elsewhere one or more times elsewhere in your query. Refer to the AsterixDB documentation on LET clauses here for more details.


LET Clause

LET Clause Diagram


WHERE Clause

WHERE clauses serve the same purpose in gSQL++ as they do in SQL++: to filter out records that do not satisfy a certain condition, specified using variables from the FROM clause. Refer to the AsterixDB documentation on WHERE clauses here for more details.


WHERE Clause

WHERE Clause Diagram


GROUP BY Clause

GROUP BY clauses serve the same purpose in gSQL++ as they do in SQL++: to organize records into groupings defined by a grouping element. gSQL++ also inherits the same grouping semantics from SQL: after a GROUP BY, the only fields that can referred to are fields from the grouping fields, or aggregate functions on the group itself (GROUP AS offers more flexibility here, as we’ll see later). Refer to the AsterixDB documentation on grouping here for more details.


GROUP BY Clause

GROUP BY Diagram


The following query retrieves how many 1 to 5 hop paths there are for every pair of users.

FROM      
    GRAPH GelpGraph
        (u1:User)-[{1,5}]->(u2:User)
GROUP BY  
    u1, 
    u2
SELECT    
    u1       AS u1, 
    u2       AS u2, 
    COUNT(*) AS cnt;

In the example above, the FROM clause produces records for all paths (of 1 to 5 hops) between u1 and u2. The GROUP BY clause then generates groups for all unique pairs of u1 and u2, and then counts all records in each group, for each group.

GROUP AS clauses serve the same purpose in gSQL++ as they do in SQL++: to preserve all records in a group, as they were before the GROUP BY clause. Refer to the AsterixDB documentation on GROUP AS clauses here for more details. When GROUP AS is used in conjunction with navigational pattern matching, one can express a wide array of queries that other languages would either dedicate special syntax for, or expose as a special function. The following query asks for the shortest path between two users u1 and u2.

FROM     
    GRAPH GelpGraph
        (u1:User)-[p+]->(u2:User)
GROUP BY  
    u1, 
    u2
    GROUP AS g
LET       
    shortestPath = (
        FROM
            g
        SELECT VALUE 
            g.p
        ORDER BY 
            LEN(EDGES(g.p)) ASC
        LIMIT    
            1
    )[0]
SELECT    
    u1, 
    u2, 
    shortestPath;

In the example above, we specify a subquery that operates on an individual group, as opposed to all paths in the FROM clause. In doing so, we can easily retrieve the shortest path by sorting our group by path length (LEN(EDGES(g.p))) and limiting our result set to 1.

Suppose now we want to modify our previous example to find the two shortest paths containing users with the name “Mary”. The following query modifies only the shortestPath subquery from the previous example and nothing else:

FROM     
    GRAPH GelpGraph
        (u1:User)-[p+]->(u2:User)
GROUP BY  
    u1, 
    u2
    GROUP AS g
LET       
    shortestPath = (
        FROM
            g
        WHERE
            SOME v IN VERTICES(g.p) SATISFIES v.name = "Mary"
        SELECT VALUE 
            g.p
        ORDER BY 
            LEN(EDGES(g.p)) ASC
        LIMIT    
            2
    )[0]
SELECT    
    u1, 
    u2, 
    shortestPath;

Graphix users don’t need to learn a new type of language / syntax to reason about path problems here, they can simply reuse their existing SQL / SQL++ knowledge.

HAVING Clause

HAVING clauses serve the same purpose in gSQL++ as they do in SQL++: to filter out groups that do not satisfy a certain condition, specified using aggregates on the group itself. Refer to the AsterixDB documentation on HAVING clauses here for more details.


HAVING Clause

HAVING Clause Diagram