This page describes the mechanisms that can be used to extend and modify query execution within ARQ. Through these mechanisms, ARQ can be used to query different graph implementations and to provide different query evaluation and optimization strategies for particular circumstances. These mechanisms are used by TDB.
ARQ can be extended in various ways to incorporate custom code into a query. Custom filter functions and property functions provide ways to add application specific code. The free text search capabilities, using Apache Lucene, are provided via a property function. Custom filter functions and property functions should be used where possible.
Jena itself can be extended by providing a new implementation of
the Graph
interface. This can be used to encapsulate specific
specialised storage and also for wrapping non-RDF sources to look
like RDF. There is a common implementation framework provided by
GraphBase
so only one operation, the find
method, needs to be
written for a read-only data source. Basic find works well in many
cases, and the whole Jena API will be able to use the extension.
For higher SPARQL performance, ARQ can be extended at the
basic graph matching or
algebra level.
Applications writers who extend ARQ at the query execution level should be prepared to work with the source code for ARQ for specific details and for finding code to reuse. Some examples can be found arq/examples directory
- Overview of ARQ Query processing
- The Main Query Engine
- Graph matching and a custom StageGenerator
- OpExecutor
- Quads
- Mixed Graph Implementation Datasets
- Custom Query Engines
- Extend the algebra
Overview of ARQ Query Processing
The sequence of actions performed by ARQ to perform a query are parsing, algebra generation, execution building, high-level optimization, low-level optimization and finally evaluation. It is not usual to modify the parsing step nor the conversion from the parse tree to the algebra form, which is a fixed algorithm defined by the SPARQL standard. Extensions can modify the algebra form by transforming it from one algebra expression to another, including introducing new operators. See also the documentation on working with the SPARQL algebra in ARQ including building algebra expressions programmatically, rather than obtaining them from a query string.
Parsing
The parsing step turns a query string into a Query
object. The
class Query
represents the abstract syntax tree (AST) for the
query and provides methods to create the AST, primarily for use by
the parser. The query object also provides methods to serialize the
query to a string. Because this is from an AST, the string produced will be
very close to the original query with the same syntactic elements,
but without comments, and formatted with whitespace for
readability. It is not usually the best way to build a query
programmatically and the AST is not normally an extension point.
The query object can be used many times. It is not modified once created, and in particular it is not modified by query execution.
Algebra generation
ARQ generates the
SPARQL algebra
expression for the query. After this a number of transformations
can be applied (for example, identification of property functions)
but the first step is the application of the algorithm in the
SPARQL specification for translating a SPARQL query string, as held
in a Query
object into a SPARQL algebra expression. This includes
the process of removing joins involving the identity pattern (the
empty graph pattern).
For example, the query:
PREFIX foaf: <http://xmlns.com/foaf/0.1/>
SELECT ?name ?mbox ?nick
WHERE { ?x foaf:name ?name ;
foaf:mbox ?mbox .
OPTIONAL { ?x foaf:nick ?nick }
}
becomes
(prefix ((foaf: <http://xmlns.com/foaf/0.1/>))
(project (?name ?mbox ?nick)
(leftjoin
(bgp
(triple ?x foaf:name ?name)
(triple ?x foaf:mbox ?mbox)
)
(bgp (triple ?x foaf:nick ?nick)
)
)))
using the SSE syntax to write out the internal data-structure for the algebra.
The online SPARQL validator at sparql.org can be used to see the algebra expression for a SPARQL query. This validator is also included in Fuseki.
High-Level Optimization and Transformations
There is a collection of transformations that can be applied to the algebra, such as replacing equality filters with a more efficient graph pattern and an assignment. When extending ARQ, a query processor for a custom storage layout can choose which optimizations are appropriate and can also provide its own algebra transformations.
A transform is code that converts an algebra operation into other
algebra operations. It is applied using the Transformer
class:
Op op = ... ;
Transform someTransform = ... ;
op = Transformer.transform(someTransform, op) ;
The Transformer
class applies the transform to each operation in
the algebra expression tree. Transform
itself is an interface,
with one method signature for each operation type, returning a
replacement for the operator instance it is called on.
One such transformation is to turn a SPARQL algebra expression
involving named graphs and triples into one using quads. This
transformation is performed by a call to Algebra.toQuadForm
.
Transformations proceed from the bottom of the expression tree to
the top. Algebra expressions are best treated as immutable so a
change made in one part of the tree should result in a copy of the
tree above it. This is automated by the TransformCopy
class
which is the commonly used base class for writing transforms. The
other helper base class is TransformBase,
which provides the
identify operation (returns the node supplied) for each transform
operation.
Operations can be printed out in
SSE syntax. The Java toString
method is overridden to provide pretty printing and the static
methods in WriterOp
provide output to various output objects like
java.io.OutputStream
.
Low-Level Optimization and Evaluation
The step of evaluating a query is the process of executing the algebra expression, as modified by any transformations applied, to yield a stream of pattern solutions. Low-level optimizations include choosing the order in which to evaluate basic graph patterns. These are the responsibility of the custom storage layer. Low-level optimization can be carried out dynamically as part of evaluation.
Internally, ARQ uses iterators extensively. Where possible,
evaluation of an operation is achieved by feeding the stream of
results from the previous stage into the evaluation. A common
pattern is to take each intermediate result one at a time (use
QueryIterRepeatApply
to be called for each binding) ,
substituting the variables of pattern with those in the incoming
binding, and evaluating to a query iterator of all results for this
incoming row. The result can be the empty iterator (one that always
returns false for hasNext
). It is also common to not have to
touch the incoming stream at all but merely to pass it to
sub-operations.
Query Engines and Query Engine Factories
The steps from algebra generation to query evaluation are carried
out when a query is executed via the QueryExecution.execSelect
or
other QueryExecution
exec operation. It is possible to carry out
storage-specific operations when the query execution is created. A
query engine works in conjunction with a QueryExecution
to provide the evaluation of a query
pattern. QueryExecutionBase
provides all the machinery for the
different result types and does not need to be modified by
extensions to query execution.
ARQ provides three query engine factories; the main query engine factory, one for a reference query engine and one to remotely execute a query. TDB provides its own query engine factories which they register during sub-system initialization. Both extend the main query engine described below.
The reference query engine is a direct top-down evaluation of the expression. Its purpose is to be simple so it can be easily verified and checked then its results used to check more complicated processing in the main engine and other implementations. All arguments to each operator are fully evaluated to produce intermediate in-memory tables then a simple implementation of the operator is called to calculate the results. It does not scale and does not perform any optimizations. It is intended to be clear and simple; it is not designed to be efficient.
Query engines are chosen by referring to the registry of query engine factories.
public interface QueryEngineFactory
{
public boolean accept(Query query, DatasetGraph dataset, Context context) ;
public Plan create(Query query, DatasetGraph dataset, Binding inputBinding, Context context) ;
public boolean accept(Op op, DatasetGraph dataset, Context context) ;
public Plan create(Op op, DatasetGraph dataset, Binding inputBinding, Context context) ;
}
When the query execution factory is given a dataset and query, the
query execution factory tries each registered engine factory in
turn calling the accept
method (for query or algebra depending on
how it was presented). The registry is kept in reverse registration
order - the most recently registered query engine factory is tried
first. The first query engine factory to return true is chosen and
no further engine factories are checked.
When a query engine factory is chosen, the create
method is
called to return a Plan
object for the execution. The main
operation of the Plan
interface is to get the QueryIterator
for
the query.
See the example arq.examples.engine.MyQueryEngine
at
jena-examples:arq/examples.
The Main Query Engine
The main query engine can execute any query. It contains a number
of basic graph pattern matching implementations including one that
uses the Graph.find
operation so it can work with any
implementation of the Jena Graph SPI. The main query engine works
with general purpose datasets but not directly with quad stores; it
evaluates patterns on each graph in turn. The main query engine
includes optimizations for the standard Jena implementation of
in-memory graphs.
High-level optimization is performed by a sequence of
transformations. This set of optimizations is evolving. A custom
implementation of a query engine can reuse some or all of these
transformations (see Algebra.optimize
which is the set of
transforms used by the main query engine).
The main query engine is a streaming engine. It evaluates
expressions as the client consumes each query solution. After
preparing the execution by creating the initial conditions (a
partial solution of one row and no bound variables or any initial
bindings of variables), the main query engine calls QC.execute
which is the algorithm to execute a query. Any extension that
wished to reuse some of the main query engine by providing its own
OpExecutor
must call this method to evaluate a sub-operation.
QC.execute
finds the currently active OpExecutor
factory,
creates an OpExecutor
object and invokes it to evaluate one
algebra operation.
There are two points of extension for the main query engine:
- Stage generators, for evaluating basic graph patterns and reusing the rest of the engine.
OpExecutor
to execute any algebra operator specially.
The standard OpExecutor
invokes the stage generator mechanism to
match a basic graph pattern.
Graph matching and a custom StageGenerator
The correct point to hook into ARQ for just extending basic graph
pattern matching (BGPs) is to provide a custom StageGenerator
.
(To hook into filtered basic graph patterns, the extension will
need to provide its own OpExecutor
factory). The advantage of
the StageGenerator
mechanism, as compared to the more general
OpExecutor
described below, is that it more self-contained and
requires less detail about the internal evaluation of the other
SPARQL algebra operators. This extension point corresponds to
section 12.6
“Extending SPARQL Basic Graph Matching”.
Below is the default code to match a BGP from
OpExecutor.execute(OpBGP, QueryIterator)
. It merely calls fixed
code in the StageBuilder
class.The input is a stream of results
from earlier stages. The execution must return a query iterator
that is all the possible ways to match the basic graph pattern for
each of the inputs in turn. Order of results does not matter.
protected QueryIterator execute(OpBGP opBGP, QueryIterator input)
{
BasicPattern pattern = opBGP.getPattern() ;
return StageBuilder.execute(pattern, input, execCxt) ;
}
The StageBuilder
looks for the stage generator by accessing the
context for the execution:
StageGenerator stageGenerator = (StageGenerator)context.get(ARQ.stageGenerator) ;
where the context is the global context and any query execution specific additions together with various execution control elements.
A StageGenerator
is an implementation of:
public interface StageGenerator
{
public QueryIterator execute(BasicPattern pattern,
QueryIterator input,
ExecutionContext execCxt) ;
}
Setting the Stage Generator
An extension stage generator can be registered on a per-query execution basis or (more usually) in the global context.
StageBuilder.setGenerator(Context, StageGenerator)
The global context can be obtained by a call to ARQ.getContext()
StageBuilder.setGenerator(ARQ.getContext(), myStageGenerator) ;
In order to allow an extensions to still permit other graphs to be used, stage generators are usually chained, with each new custom one passing the execution request up the chain if the request is not supported by this custom stage generator.
public class MyStageGenerator implements StageGenerator
{
StageGenerator above = null ;
public MyStageGenerator (StageGenerator original)
{ above = original ; }
@Override
public QueryIterator execute(BasicPattern pattern, QueryIterator input, ExecutionContext execCxt)
{
Graph g = execCxt.getActiveGraph() ;
// Test to see if this is a graph we support.
if ( ! ( g instanceof MySpecialGraphClass ) )
// Not us - bounce up the StageGenerator chain
return above.execute(pattern, input, execCxt) ;
MySpecialGraphClass graph = (MySpecialGraphClass )g ;
// Create a QueryIterator for this request
...
This is registered by setting the global context (StageBuilder
has a convenience operation to do this):
// Get the standard one.
StageGenerator orig = (StageGenerator)ARQ.getContext().get(ARQ.stageGenerator) ;
// Create a new one
StageGenerator myStageGenerator= new MyStageGenerator(orig) ;
// Register it
StageBuilder.setGenerator(ARQ.getContext(), myStageGenerator) ;
Example: jena-examples:arq/examples/bgpmatching
OpExecutor
A StageGenerator
provides matching for a basic graph pattern. If
an extension wishes to take responsibility for more of the
evaluation then it needs to work with OpExecutor
. This includes
evaluation of filtered basic graph patterns.
An example query using a filter:
PREFIX dc: <http://purl.org/dc/elements/1.1/>
PREFIX books: <http://example.org/book/>
SELECT *
WHERE
{ ?book dc:title ?title .
FILTER regex(?title, "Paddington")
}
results in the algebra expression for the pattern:
(filter (regex ?title "Paddington")
(bgp (triple ?book dc:title ?title)
))
showing that the filter is being applied to the results of a basic graph pattern matching.
Note: this is not the way to provide custom filter operations. See the documentation for application-provided filter functions.
Each step of evaluation in the main query engine is performed by a
OpExecutor
and a new one is created from a factory at each step.
The factory is registered in the execution context. The
implementation of a specialized OpExecutor
can inherit from the
standard one and override only those algebra operators it wishes to
deal with, including inspecting the execution and choosing to
pass up to the super-class based on the details of the
operation. From the query above, only regex filters might be
specially handled.
Registering an OpExecutorFactory
:
OpExecutorFactory customExecutorFactory = new MyOpExecutorFactory(...) ;
QC.setFactory(ARQ.getCOntext(), customExecutorFactory) ;
QC
is a point of indirection that chooses the execution process at
each stage in a query so if the custom execution wishes to evaluate
an algebra operation within another operation, it should call
QC.execute
. Be careful not to loop endlessly if the operation is
itself handled by the custom evaluator. This can be done by
swapping in a different OpExecutorFactory
.
// Execute an operation with a different OpExecution Factory
// New context.
ExecutionContext ec2 = new ExecutionContext(execCxt) ;
ec2.setExecutor(plainFactory) ;
QueryIterator qIter = QC.execute(op, input, ec2) ;
private static OpExecutorFactory plainFactory =
new OpExecutorFactory()
{
@Override
public OpExecutor create(ExecutionContext execCxt)
{
// The default OpExecutor of ARQ.
return new OpExecutor(execCxt) ;
}
} ;
Quads
If a custom extension provides named graphs, then it may be useful
to execute the quad form of the query. This is done by writing a
custom query engine and overriding QueryEngineMain.modifyOp
:
@Override
protected Op modifyOp(Op op)
{
op = Substitute.substitute(op, initialInput) ;
// Use standard optimizations.
op = super.modifyOp(op) ;
// Turn into quad form.
op = Algebra.toQuadForm(op) ;
return op ;
}
The extension may need to provide its own dataset implementation so that it can detect when queries are directed to its named graph storage. TDB are examples of this.
Mixed Graph Implementation Datasets
The dataset implementation used in normal operation does not work
on quads but instead can provide a dataset with a collection of
graphs each from different implementation sub-systems. In-memory
graphs can be mixed with database backed graphs as well as custom
storage systems. Query execution proceeds per-graph so that an
custom OpExecutor
will need to test the graph to work with to
make sure it is of the right class. The pattern in the
StageGenerator
extension point is an example of a design pattern in
that situation.
Custom Query Engines
A custom query engine enables an extension to choose which datasets
it wishes to handle. It also allows the extension to intercept
query execution during the setup of the execution so it can modify
the algebra expression, introduce its own algebra extensions,
choose which high-level optimizations to apply and also transform
to the expression into quad form. Execution can proceed with the
normal algorithm or a custom OpExecutor
or a custom Stage
Generator or a combination of all three extension mechanism.
Only a small, skeleton custom query engine is needed to intercept
the initial setup. See the example in
jena-examples:arq/examples
arq.examples.engine.MyQueryEngine
.
While it is possible to replace the entire process of query
evaluation, this is a substantial endeavour. QueryExecutionBase
provides the machinery for result presentation (SELECT
,
CONSTRUCT
, DESCRIBE
, ASK
), leaving the work of pattern
evaluation to the custom query engine.
Algebra Extensions
New operators can be added to the algebra using the OpExt
class
as the super-class of the new operator. They can be inserted into
the expression to be evaluated using a custom query engine to
intercept evaluation initialization. When evaluation of a query
requires the evaluation of a sub-class of OpExt
, the eval
method is called.