Getting Started
In this tutorial, we are going to start a 1-node Graphix cluster, establish a collection of AsterixDB datasets, build a graph over these datasets, and query the graph we just built.
Table of Contents
- Starting a Sample Cluster
- Building AsterixDB Datasets
- Defining a Graphix Graph
- Querying our Graphix Graph
- Stopping our Sample Cluster
Starting a Sample Cluster
- Head on over to the Installation section and install AsterixDB + Graphix.
- We are going to follow the instructions using a pre-built package. Execute the
quickstart.shscript to start a 1-node cluster using the Graphix extension../quickstart.sh
Building AsterixDB Datasets
- For our tutorial, we use the “Gelp” example: Users and their friends make Reviews about Businesses. To start, let’s create a new dataverse and all the aforementioned entities as datasets.
CREATE DATAVERSE Gelp; USE Gelp; CREATE TYPE BusinessesType AS { business_id : string }; CREATE DATASET Businesses (BusinessesType) PRIMARY KEY business_id; CREATE TYPE UsersType AS { user_id : bigint }; CREATE DATASET Users (UsersType) PRIMARY KEY user_id; CREATE TYPE ReviewsType AS { review_id : string }; CREATE DATASET Reviews (ReviewsType) PRIMARY KEY review_id;In the example above, all three datasets only have their primary keys defined. All other fields associated with each entity exist as open fields.
- Let’s now insert some data into our dataverse. We’ll start with our
Businessesdataset.INSERT INTO Gelp.Businesses [ { "business_id": "B1", "name": "Papa's Supermarket", "number": "909-123-6123" }, { "business_id": "B2", "name": "Mother's Gas Station", "number": "111-724-1123" }, { "business_id": "B3", "name": "Uncle's Bakery" } ];The three records inserted show two fields that were not defined in the
BusinessesTypedata type:nameandnumber. The last record illustrates the potential heterogeneity enabled by AsterixDB’s document data model, where some businesses may not have a number attached to them. - Having populated our
Businessesdataset, let’s now move onto ourUsers:INSERT INTO Gelp.Users [ { "user_id": 1, "name": "Mary", "friends": [ 2 ] }, { "user_id": 2, "name": "John", "friends": [ 1, 3, 4 ] }, { "user_id": 3, "name": "Kevin", "friends": [ 2, 5 ] }, { "user_id": 4, "name": "Susan", "friends": [ 2, 5 ] }, { "user_id": 5, "name": "Larry", "friends": [ 3, 4 ] } ];Similar to our
Businessesrecords, theseUsersrecords inserted include two fields that weren’t defined in their dataset type:nameandfriends. A user may have a name and an array ofuser_idvalues denoting their friends. The potentialfriendsarray of a user depicts a common denormalized form of modeling one-to-many relationships, again enabled by AsterixDB’s document data model. - Finally, let’s move onto our last dataset:
Reviews.INSERT INTO Gelp.Reviews [ { "review_id": "R1", "user_id": 1, "business_id": "B3", "review_time": date("2022-03-01") }, { "review_id": "R2", "user_id": 3, "business_id": "B3", "review_time": date("2022-03-01") }, { "review_id": "R3", "user_id": 2, "business_id": "B3", "review_time": date("2022-03-02") }, { "review_id": "R4", "user_id": 5, "business_id": "B1", "review_time": date("2022-03-03") }, { "review_id": "R5", "user_id": 5, "business_id": "B2" } ];A review may include an associated user, business, and review time.
Defining a Graphix Graph
-
At this point, we have not gone over anything new (in the context of AsterixDB). We now have a logical data model for Gelp with three defined datasets:
Users,Reviews, andBusinesses. To iterate, these three datasets are used to model the following:Users and their friends make Reviews about Businesses.
Graphically, we can represent this statement as follows:
We will now build a managed graph piece by piece. We start with a name for our graph:
GelpGraph.CREATE GRAPH GelpGraph AS ... ; -
Now let us define our vertices. As depicted in the diagram above, we have three types of vertices: User, Review, and Business. In the context of the Property Graph Model, these vertex “types” will act as our vertex labels.
- Each vertex definition requires three pieces of information: the vertex label the vertex body and the vertex key. The vertices of label
Businessare defined using theGelp.Businessesdataset, where each record inGelp.Businessescorresponds to a vertex in our graph. The primary key of aBusinessvertex is the same as the logical primary key of its vertex body:business_id. With these three pieces of information, we define the schema of aBusinessvertex in theGelpGraphas such:VERTEX (:Business) PRIMARY KEY (business_id) AS Gelp.Businesses - The vertices of label
Userare similarly defined using theGelp.Usersdataset, where each record inGelp.Userscorresponds to a vertex in our graph. The primary key of aUservertex is again the same as the logical primary key of its vertex body:user_id. We define the schema of aUservertex in theGelpGraphas such:VERTEX (:User) PRIMARY KEY (user_id) AS Gelp.Users - We now move onto the last type of vertex:
Review. A vertex of labelReviewis defined using theGelp.Reviewsdataset, with the same primary key as its body:review_id. Now suppose that we want to defineReviewvertices usingGelp.Reviewsrecords that have a value forreview_time. The body of a vertex is similar to that of an AsterixDB view body: we could either use an existing dataset as the vertex body, or a more general query. We will use the latter here for ourReviewvertex:VERTEX (:Review) PRIMARY KEY (review_id) AS ( FROM Gelp.Reviews R WHERE R.review_time IS NOT UNKNOWN SELECT VALUE R )
- Each vertex definition requires three pieces of information: the vertex label the vertex body and the vertex key. The vertices of label
-
With our vertices defined, we now will define our edges. Referencing our diagram above, we have three types of relationships between our vertices: (1) Reviews are ABOUT Businesses. (2) Reviews are MADE_BY Users. (3) Users are FRIENDS_WITH other Users. These relationship “types” will act as our edge labels.
- Each edge definition requires six pieces of information now: the source vertex label and key, the destination vertex label and key, the edge label, & the edge body. The edges of label
ABOUThave source vertices of the labelReviewand destination vertices of the labelBusiness. The edge bodies of labelABOUTare defined using query that references theGelp.Reviewsdataset. The goal of this edge body is to specify i) a field that will be used to connect (orJOIN) the edge body to the source label (i.e. our source key), ii) a field that will be used to connect the edge body to the destination label (i.e. our destination key), and iii) any edge properties. With these six pieces of information, we define the schema of anABOUTedge in theGelpGraphas such:EDGE (:Review)-[:ABOUT]->(:Business) SOURCE KEY (review_id) DESTINATION KEY (business_id) AS ( FROM Gelp.Reviews R WHERE R.review_time IS NOT UNKNOWN SELECT R.review_id, R.business_id )The fact that our edge body shares a
FROMclause with the vertex body forReviewillustrates a trait of our underlying datasets: theReviewsdataset has an embedded 1:N relationship with ourBusinessesdataset. For those familiar with translating Entity-Relationship diagrams into SQL tables, the purpose of an edge body is to specify a relationship table that holds foreign key references to two other tables (in our case, vertices). - The edges of label
MADE_BYare similarly defined to edges of labelABOUT.MADE_BYedges have source vertices of the labelReviewand destination vertices of the labelUser. Of our edge body, the key used to connect our edge toReviewvertices is(review_id). The key used to connect our edge toUservertices is(user_id). We define the schema of aMADE_BYedge in theGelpGraphas such:EDGE (:Review)-[:MADE_BY]->(:User) SOURCE KEY (review_id) DESTINATION KEY (user_id) AS ( FROM Gelp.Reviews R WHERE R.review_time IS NOT UNKNOWN SELECT R.review_id AS review_id, R.user_id AS user_id ) - The edges of label
FRIENDS_WITHhave source vertices of labelUserand destination vertices of the labelUser. The edge bodies are defined using anUNNESTquery of theGelp.Usersdataset. Of our edge body, the key used to connect our edge to our sourceUservertex is(user_id). The key used to connect our edge to our destinationUservertex is(friend). We define the schema of aFRIENDS_WITHedge in theGelpGraphas such:EDGE (:User)-[:FRIENDS_WITH]->(:User) SOURCE KEY (user_id) DESTINATION KEY (friend) AS ( FROM Gelp.Users U, U.friends F SELECT F AS friend, U.user_id AS user_id )
- Each edge definition requires six pieces of information now: the source vertex label and key, the destination vertex label and key, the edge label, & the edge body. The edges of label
-
When we put all these pieces together, we get the following:
CREATE GRAPH Gelp.GelpGraph AS VERTEX (:Business) PRIMARY KEY (business_id) AS Gelp.Businesses, VERTEX (:User) PRIMARY KEY (user_id) AS Gelp.Users, VERTEX (:Review) PRIMARY KEY (review_id) AS ( FROM Gelp.Reviews R WHERE R.review_time IS NOT UNKNOWN SELECT VALUE R ), EDGE (:Review)-[:ABOUT]->(:Business) SOURCE KEY (review_id) DESTINATION KEY (business_id) AS ( FROM Gelp.Reviews R WHERE R.review_time IS NOT UNKNOWN SELECT R.review_id, R.business_id ), EDGE (:Review)-[:MADE_BY]->(:User) SOURCE KEY (review_id) DESTINATION KEY (user_id) AS ( FROM Gelp.Reviews R WHERE R.review_time IS NOT UNKNOWN SELECT R.review_id, R.user_id ), EDGE (:User)-[:FRIENDS_WITH]->(:User) SOURCE KEY (user_id) DESTINATION KEY (friend) AS ( FROM Gelp.Users U, U.friends F SELECT F AS friend, U.user_id AS user_id );Issuing the statement above will create a managed graph in Graphix.
Querying our Graphix Graph
- Let’s now query our data. To start, let’s see what our
Businessvertices look like. We build the following gSQL++ query:FROM GRAPH Gelp.GelpGraph (b:Business) SELECT b;The query above starts by specifying the graph we are querying (the
FROM GRAPH Gelp.GelpGraphline), followed by a graph pattern consisting of a single vertex whose label is toBusiness, concluding with aSELECTclause containing the variable of our vertex. For a more in-depth explanation on what a graph pattern is, see the Graphix Query Model page. If we issue our query, we get the following results:{ "business_id": "B1", "name": "Papa's Supermarket", "number": "909-123-6123" } { "business_id": "B2", "name": "Mother's Gas Station", "number": "111-724-1123" } { "business_id": "B3", "name": "Uncle's Bakery" }Our results are all records from the
Businessesdataset, which makes sense given how we defined aBusinessvertex in our graph. - Let’s now see what an edge looks like. In particular, let’s see what all
ABOUTedges return. We build the following gSQL++ query:FROM GRAPH Gelp.GelpGraph (:Review)-[a:ABOUT]->(:Business) SELECT a;Issuing the query above yields the following results:
{ "a": { "review_id": "R1", "business_id": "B3" } } { "a": { "review_id": "R2", "business_id": "B3" } } { "a": { "review_id": "R3", "business_id": "B3" } } { "a": { "review_id": "R4", "business_id": "B1" } }In contrast to our
Businessvertices, ourReviewvertices and all connecting edges filter out records from theReviewsdataset if theirreview_timefield isNULLorMISSING. The query being executed by AsterixDB is analogous to the following SQL++ query:FROM Gelp.Reviews R, Gelp.Businesses B LET a = { "review_id": R.review_id, "business_id": R.business_id } WHERE R.business_id = B.business_id SELECT a; - Suppose that we now want to find whether two users are connected by some number of friends in our graph. In this scenario, we need to describe a path between two vertices instead of an edge. We build the following gSQL++ query:
SET `graphix.compiler.permit.unbounded-all-paths` "true"; FROM GRAPH Gelp.GelpGraph (u1:User)-[f:FRIENDS_WITH+]->(u2:User) LET pathIDs = ( FROM VERTICES(f) fv SELECT VALUE fv.user_id ) SELECT u1.user_id AS u1_user_id, u2.user_id AS u2_user_id, pathIDs AS pathIDs ORDER BY u1_user_id, u2_user_id;Issuing the query above yields the following results (shortened for brevity):
{ "u1_user_id": 1, "u2_user_id": 2, "pathIDs": [ 1, 2 ] } { "u1_user_id": 1, "u2_user_id": 3, "pathIDs": [ 1, 2, 3 ] } { "u1_user_id": 1, "u2_user_id": 3, "pathIDs": [ 1, 2, 4, 5, 3 ] } { "u1_user_id": 1, "u2_user_id": 4, "pathIDs": [ 1, 2, 4 ] } { "u1_user_id": 1, "u2_user_id": 4, "pathIDs": [ 1, 2, 3, 5, 4 ] } { "u1_user_id": 1, "u2_user_id": 5, "pathIDs": [ 1, 2, 3, 5 ] } { "u1_user_id": 1, "u2_user_id": 5, "pathIDs": [ 1, 2, 4, 5 ] } { "u1_user_id": 2, "u2_user_id": 1, "pathIDs": [ 2, 1 ] } { "u1_user_id": 2, "u2_user_id": 3, "pathIDs": [ 2, 3 ] } { "u1_user_id": 2, "u2_user_id": 3, "pathIDs": [ 2, 4, 5, 3 ] } { "u1_user_id": 2, "u2_user_id": 4, "pathIDs": [ 2, 4 ] } { "u1_user_id": 2, "u2_user_id": 4, "pathIDs": [ 2, 3, 5, 4 ] } { "u1_user_id": 2, "u2_user_id": 5, "pathIDs": [ 2, 3, 5 ] } { "u1_user_id": 2, "u2_user_id": 5, "pathIDs": [ 2, 4, 5 ] } { "u1_user_id": 3, "u2_user_id": 1, "pathIDs": [ 3, 2, 1 ] } { "u1_user_id": 3, "u2_user_id": 1, "pathIDs": [ 3, 5, 4, 2, 1 ] } { "u1_user_id": 3, "u2_user_id": 2, "pathIDs": [ 3, 2 ] } { "u1_user_id": 3, "u2_user_id": 2, "pathIDs": [ 3, 5, 4, 2 ] } { "u1_user_id": 3, "u2_user_id": 4, "pathIDs": [ 3, 2, 4 ] } { "u1_user_id": 3, "u2_user_id": 4, "pathIDs": [ 3, 5, 4 ] } { "u1_user_id": 3, "u2_user_id": 5, "pathIDs": [ 3, 5 ] }The query above illustrates a navigational graph pattern, where
fcorresponds to a path instead of an edge. Vertices of a path are accessed using theVERTICESfunction, and edges of a path are accessed using theEDGESfunction. Note that by default, such paths are disabled in Graphix (and must be explicitly enabled via thegraphix.compiler.permit.unbounded-all-pathsoption). - The previous query yields a large number of results (too many to display here). Let’s expand on the previous scenario: suppose we are not interested in all paths between the same two users, but instead we are interested in the shortest path. We build the following gSQL++ query, taking advantage of how SQL++ treats grouping:
FROM GRAPH Gelp.GelpGraph (u1:User)-[f:FRIENDS_WITH+]->(u2:User) GROUP BY u1, u2 GROUP AS g LET shortestPath = ( FROM g LET pathIDs = ( FROM VERTICES(g.f) fv SELECT VALUE fv.user_id ) SELECT VALUE pathIDs ORDER BY LEN(EDGES(g.f)) ASC LIMIT 1 )[0] SELECT u1.user_id AS u1_user_id, u2.user_id AS u2_user_id, shortestPath AS shortestPath, COUNT(*) AS totalPaths;Issuing the query above yields the following results:
{ "u1_user_id": 2, "u2_user_id": 1, "shortestPath": [ 2, 1 ], "totalPaths": 1 } { "u1_user_id": 2, "u2_user_id": 3, "shortestPath": [ 2, 3 ], "totalPaths": 2 } { "u1_user_id": 2, "u2_user_id": 4, "shortestPath": [ 2, 4 ], "totalPaths": 2 } { "u1_user_id": 2, "u2_user_id": 5, "shortestPath": [ 2, 3, 5 ], "totalPaths": 2 } { "u1_user_id": 1, "u2_user_id": 2, "shortestPath": [ 1, 2 ], "totalPaths": 1 } { "u1_user_id": 1, "u2_user_id": 3, "shortestPath": [ 1, 2, 3 ], "totalPaths": 2 } { "u1_user_id": 1, "u2_user_id": 4, "shortestPath": [ 1, 2, 4 ], "totalPaths": 2 } { "u1_user_id": 1, "u2_user_id": 5, "shortestPath": [ 1, 2, 3, 5 ], "totalPaths": 2 } { "u1_user_id": 3, "u2_user_id": 2, "shortestPath": [ 3, 2 ], "totalPaths": 2 } { "u1_user_id": 3, "u2_user_id": 1, "shortestPath": [ 3, 2, 1 ], "totalPaths": 2 } { "u1_user_id": 3, "u2_user_id": 4, "shortestPath": [ 3, 2, 4 ], "totalPaths": 2 } { "u1_user_id": 3, "u2_user_id": 5, "shortestPath": [ 3, 5 ], "totalPaths": 2 } { "u1_user_id": 4, "u2_user_id": 2, "shortestPath": [ 4, 2 ], "totalPaths": 2 } { "u1_user_id": 4, "u2_user_id": 1, "shortestPath": [ 4, 2, 1 ], "totalPaths": 2 } { "u1_user_id": 4, "u2_user_id": 3, "shortestPath": [ 4, 2, 3 ], "totalPaths": 2 } { "u1_user_id": 4, "u2_user_id": 5, "shortestPath": [ 4, 5 ], "totalPaths": 2 } { "u1_user_id": 5, "u2_user_id": 2, "shortestPath": [ 5, 3, 2 ], "totalPaths": 2 } { "u1_user_id": 5, "u2_user_id": 1, "shortestPath": [ 5, 3, 2, 1 ], "totalPaths": 2 } { "u1_user_id": 5, "u2_user_id": 3, "shortestPath": [ 5, 3 ], "totalPaths": 2 } { "u1_user_id": 5, "u2_user_id": 4, "shortestPath": [ 5, 4 ], "totalPaths": 2 }The query above can be thought of as grouping all distinct pairs of users
u1andu2, then fetching the path out of all paths betweenu1andu2that has the shortest length. The ability to operate on groups using sub-queries in SQL++ gives gSQL++ users the power to express a rich set of typical path finding problems (e.g. weighted shortest paths).
Stopping our Sample Cluster
- Navigate to the pre-built package directory from before.
-
Execute the
quickstop.shscript../quickstop.sh