Queries
Table of Contents
- Queries (Overview)
FROM
ClauseMATCH
ExpressionWITH
ClauseLET
ClauseWHERE
ClauseGROUP BY
ClauseHAVING
Clause
Queries (Overview)
A query can either be an expression (whose composition remains unchanged from SQL++), or a construction of query blocks. A query block may contain several clauses, including SELECT
, FROM
, LET
, WHERE
, GROUP BY
, and HAVING
. The following productions are also unchanged from SQL++.
Query
Selection
Query Block
Stream Generator
Similar to SQL++ (but unlike SQL), gSQL++ allows the SELECT
clause to appear either at the beginning or the end of a query black. For some queries, placing the SELECT
clause at the end may make a query block easier to understand because the SELECT
clause refers to variables defined in the stream generator production.
FROM
Clause
The purpose of a FROM
clause is to iterate over a collection, binding a variable to each item in turn. Graphix (and gSQL++) do not change the purpose of the FROM
clause, rather the gSQL++ extension gives users an additional method to specify variable bindings (in the form of vertices, edges, and paths). As shown in the FromTerm
production, gSQL++ users are free to interleave their graph bindings with existing SQL++ document bindings to express truly synergistic document-based and graph-based queries.
FROM Clause
From Term
Qualified Name
Named Expression (NamedExpr)
JOIN Step
UNNEST Step
MATCH Step
Below is a query that illustrates the usage of the gSQL++ extension to iterate over mappings of a single-edge pattern to our managed graph GelpGraph
.
FROM
GRAPH GelpGraph
(u1:User)-[:FRIENDS_WITH]->(u2:User)
SELECT
u1,
u2;
Putting aside the good practice of specifying explicit iteration variables in SQL++, the problem of specifying graph query patterns in nearly all non-trivial use cases involves describing more than one graph element. Consequently, gSQL++ does not support implicit iteration variables for variables defined in MatchExpr
and MatchStep
productions. This lack of support contrasts SQL++, where one-dataset queries are (arguably) more common.
MATCH
Expression
The purpose of a PatternExpr
is to specify a (potentially navigational) graph pattern and introduce all mapping permutations of the graph pattern to the underlying data.
MATCH Expression (MatchExpr)
Pattern Expression (PatternExpr)
Vertex Pattern
Vertex Detail
Edge Pattern
Edge Detail
Path Pattern
Path Detail
Label Set
Repetition Quantifier
The following query illustrates a basic graph pattern (BGP) matching query, which finds all users that have written a review.
FROM
GRAPH GelpGraph
(r:Review)<-[mb:MADE_BY]-(u:User)
SELECT DISTINCT
u;
In the example above, two vertices and one edge are specified in the graph pattern. An edge must always connect exactly two vertices, but a vertex can be specified without an edge.
Graph patterns are implicitly joined with one another if they share a vertex variable. Such style may be useful for improving the readability of your query. The example below shows two graph patterns: the first of which finds all users that have written a review, and the second of which finds all users and their friends.
FROM
GRAPH GelpGraph
(r:Review)<-[mb:MADE_BY]-(u1:User),
(u1)-[:FRIENDS_WITH]->(u2:User)
SELECT DISTINCT
u1;
Conceptually, all permutations of both patterns are then joined on their common vertices. If a pattern does not share any vertices with any other pattern in that specific FromTerm
, then we say that our pattern is disjoint. Disjoint patterns are analogous to cartesian products in SQL: the binding tuple stream after the FromTerm
contains all possible pairs of the disjoint patterns.
Vertices and edges are the core of all graph queries, but often we may want to reason about our graph at the level of paths. Paths are a collection of edges, and can be specified in gSQL++ in a similar manner to an edge pattern. The difference with paths, however, is the use of a repetition quantifier after the variable and label specification. Paths are described in Graphix using a regular expression of edge labels. In the query below, we assign a variable of p
to a path consisting of only 1 to 3 FRIENDS_WITH
edges.
FROM
GRAPH GelpGraph
(u1:User)<-[p:FRIENDS_WITH{1,3}]-(u2:User)
SELECT VALUE
PATH_VERTICES(p);
Similar to SQL, gSQL++ offers optional binding to patterns with the LEFT MATCH
clause. The following example asks for users and their friends, as well as reviews if they have any.
FROM
GRAPH GelpGraph
(u1:User)-[:FRIENDS_WITH]->(u2:User)
LEFT MATCH
(u1)<-[mb:MADE_BY]-(r:Review)
SELECT
u1,
u2,
mb,
r;
gSQL++ will only bind to variables declared in a LEFT MATCH
clause if (and only if) the corresponding pattern can be matched in full. In the example above, if there was a mb
edge record that was connected to u1
but not to any existing user, then MISSING
would be bound to both mb
and r
for that instance. The flexibility of AsterixDB’s data model means that edges in Graphix may not be “consistent”. gSQL++ will always work in units of patterns, not individual collections.
Features such as edge label alternation and negation are planned, but not implemented yet.
WITH
Clause
Similar to SQL and SQL++, WITH
clauses are used to improve the modularity of one’s query. Typically, a WITH
clause will contain a subquery whose results will be used elsewhere in the query (although the expression bound with a WITH
doesn’t explicitly have to a QueryBlock production). In gSQL++, users can also use a WITH
clause to define a temporary graph. WITH GRAPH
is the alternative method used to define graphs (compared to CREATE GRAPH
), and should be used when a graph only needs to be known in the context of a single query.
WITH Clause
WITH Term
The following example defines a temporary graph called NewGelpGraph
that reuses the FRIENDS_WITH
edge definition from GelpGraph
to add a weight attribute that depends on the number of reviews each user has written.
WITH
GRAPH NewGelpGraph AS
VERTEX (:User)
PRIMARY KEY (user_id)
AS Gelp.Users,
EDGE (:User)-[:FRIENDS_WITH]->(:User)
SOURCE KEY (user_id)
DESTINATION KEY (friend)
AS (
FROM
GRAPH YelpGraph
(u1:User)-[:KNOWS]->(u2:User),
(u1)-[:MADE]->(r1:Review),
(u2)-[:MADE]->(r2:Review)
GROUP BY
u1.user_id,
u2.user_id
LET
u1_reviews = COUNT(DISTINCT r1),
u2_reviews = COUNT(DISTINCT r2)
SELECT
u1.user_id AS user_id,
u2.user_id AS friend,
u1_reviews + u2_reviews AS weight
)
FROM
GRAPH NewGelpGraph
(u1:User)-[fw]->(u2:User)
WHERE
fw.weight > 10
SELECT
u1,
u2;
As alluded to in the GROUP BY
section here, this weight attribute can later be used to define custom attributes on-the-fly to build cheapest-path queries.
LET
Clause
LET
clauses serve the same purpose in gSQL++ as they do in SQL++: for specifying an expression once, but referring to the expression elsewhere one or more times elsewhere in your query. Refer to the AsterixDB documentation on LET
clauses here for more details.
LET Clause
WHERE
Clause
WHERE
clauses serve the same purpose in gSQL++ as they do in SQL++: to filter out records that do not satisfy a certain condition, specified using variables from the FROM
clause. Refer to the AsterixDB documentation on WHERE
clauses here for more details.
WHERE Clause
GROUP BY
Clause
GROUP BY
clauses serve the same purpose in gSQL++ as they do in SQL++: to organize records into groupings defined by a grouping element. gSQL++ also inherits the same grouping semantics from SQL: after a GROUP BY
, the only fields that can referred to are fields from the grouping fields, or aggregate functions on the group itself (GROUP AS
offers more flexibility here, as we’ll see later). Refer to the AsterixDB documentation on grouping here for more details.
GROUP BY Clause
The following query retrieves how many 1 to 5 hop paths there are for every pair of users.
FROM
GRAPH GelpGraph
(u1:User)-[{1,5}]->(u2:User)
GROUP BY
u1,
u2
SELECT
u1 AS u1,
u2 AS u2,
COUNT(*) AS cnt;
In the example above, the FROM
clause produces records for all paths (of 1 to 5 hops) between u1
and u2
. The GROUP BY
clause then generates groups for all unique pairs of u1
and u2
, and then counts all records in each group, for each group.
GROUP AS
clauses serve the same purpose in gSQL++ as they do in SQL++: to preserve all records in a group, as they were before the GROUP BY
clause. Refer to the AsterixDB documentation on GROUP AS
clauses here for more details. When GROUP AS
is used in conjunction with navigational pattern matching, one can express a wide array of queries that other languages would either dedicate special syntax for, or expose as a special function. The following query asks for the shortest path between two users u1
and u2
.
FROM
GRAPH GelpGraph
(u1:User)-[p+]->(u2:User)
GROUP BY
u1,
u2
GROUP AS g
LET
shortestPath = (
FROM
g
SELECT VALUE
g.p
ORDER BY
LEN(EDGES(g.p)) ASC
LIMIT
1
)[0]
SELECT
u1,
u2,
shortestPath;
In the example above, we specify a subquery that operates on an individual group, as opposed to all paths in the FROM
clause. In doing so, we can easily retrieve the shortest path by sorting our group by path length (LEN(EDGES(g.p))
) and limiting our result set to 1.
Suppose now we want to modify our previous example to find the two shortest paths containing users with the name “Mary”. The following query modifies only the shortestPath
subquery from the previous example and nothing else:
FROM
GRAPH GelpGraph
(u1:User)-[p+]->(u2:User)
GROUP BY
u1,
u2
GROUP AS g
LET
shortestPath = (
FROM
g
WHERE
SOME v IN VERTICES(g.p) SATISFIES v.name = "Mary"
SELECT VALUE
g.p
ORDER BY
LEN(EDGES(g.p)) ASC
LIMIT
2
)[0]
SELECT
u1,
u2,
shortestPath;
Graphix users don’t need to learn a new type of language / syntax to reason about path problems here, they can simply reuse their existing SQL / SQL++ knowledge.
HAVING
Clause
HAVING
clauses serve the same purpose in gSQL++ as they do in SQL++: to filter out groups that do not satisfy a certain condition, specified using aggregates on the group itself. Refer to the AsterixDB documentation on HAVING
clauses here for more details.
HAVING Clause