Thursday, 6 April 2017

ElasticSearch Query: SQL Query

ElasticSearch(ES) is a search engine based on Lucene. It provides a distributed, multitenant-capable full-text search engine with an HTTP web interface and schema-free JSON documents. Elasticsearch is developed in Java and is released as open source under the terms of the Apache License.
Just as in my last article, where I gave an equivalent of SQL query in Solr: http://starschemaanalysis.blogspot.in/2017/04/solr-query-sql-query.html, I will do the same for ES. ES provides a lot of flexibility and has an equivalent for complex SQL; we will try to explore some of those here.
It is assumed that you already have installed Elasticsearch and can run JSON Query using Kibana Dev tools or any other methods(Curl,etc..).
Let us start with a table "transactions" with columns-price (number), color (varchar), make(varchar), sold(date).
Insert the sample Elastic data into "transactions" folder. The command to search data is: GET /transactions/_search
1) Select All
SQL Query: Select * from transactions
ES Query: {  "query": {"match_all": {} } }
2) Select a filter. Color: red. There are simpler versions than this but to keep the queries consistent, I have used the following format. Must is alternate of "and" in SQL and will play a role when we have more than one filter.
SQL Query : Select * from transactions where color = 'red'
ES Query: { "query":{ "constant_score" : {"filter":{ "bool":{"must" :[{ "term": { "color": "red" } } ] }} } }}
3) Select using a filter, with more than one value. Color: red, green.
SQL Query : Select * from transactions where color in ('red','green') and
ES Query: { "query":{ "constant_score" : {"filter":{ "bool":{"must" :[{ "terms": { "color": ["red","green"] } },{ "terms": { "make": ["ford","honda"] } } ] }} } }}
4) Multiple filters Boolean AND: "And" in SQL and "Must" in ES
SQL Query: Select * from transactions where color in ('red','green') and make in ('ford','honda')
ES Query: { "query":{ "constant_score" : {"filter":{ "bool":{"must" :[{ "terms": { "color": ["red","green"] } },{ "terms": { "make": ["ford","honda"] } } ] }} } }}
5) Two filters Boolean OR. "OR" in SQL , "Should" in ES
SQL Query: Select * from transactions where (color in ('red','green') or make in ('ford','honda'))
ES Query: { "query":{ "constant_score" : {"filter":{ "bool":{"should" :[{ "terms": { "color": ["red","green"] } },{ "terms": { "make": ["ford","honda"] } } ] }} } }}
6) Date filter
SQL Query:Select * from transactions where (color in ('red','green') and make in ('ford','honda')) and sold between '2014-01-01' and '2014-12-31'
ES Query: { "query":{ "constant_score" : {"filter":{ "bool":{"must" :[{ "terms": { "color": ["red","green"] } },{ "terms": { "make": ["ford","honda"] } } ] ,"filter" :[{ "range" : { "sold" : { "gte" : "2014-01-01", "lte" : "2014-12-31" } } } ] }} } }}
7) Data Aggregation. size:0 has been used to get only aggreated data.
SQL Query: Select sum(price) single_sum_price":, avg(price) single_avg_price from transactions
ES Query: { "size" : 0,   "aggs" : {"single_avg_price": { "avg" : { "field" : "price" }}, "single_sum_price": {"sum" : { "field" : "price" }}}}
8) Group By
SQL Query: Select color, sum(price) single_sum_price , avg(price) single_avg_price from transactions group by color
ES Query:{ "size": 0,"aggs": {"group_by_color": {"terms": {"field": "color.keyword"},"aggs": {"average_balance": {"avg": {"field": "price"}},"sum_balance": {"sum": {"field": "price"}}}}}}
8) More than one Group By
SQL Query: Select make,color sum(price) single_sum_price , avg(price) single_avg_price from transactions group by make,color
ES Query:{ "size": 0, "aggs": { "group_by_make": {"terms": { "field": "make.keyword" }, "aggs": {"group_by_color": { "terms": {"field": "color.keyword"},"aggs": {"average_price": {"avg": {"field": "price"}}}}}}}}
9) A Complex Query: Here we can use formulas like a/b and sum(a)/sum(b). I am using one measure and a constant in place of the other - hence instead of a/b I shall be using a/2. Using different aggegations I am doing sum(A)/count(A). I am also using period label: MTD. We can take use such labels in muti-period query.
SQL Query:Select color,make sum(price) sprice , count(price) cprice, avg(price/2) avgvalue1, sum(price)/count(price) avgvalue2 from transactions group by color,make
ES Query : {"size": 0 , "aggs": { "filter1" : {"filter":{ "bool":{"must" :[{ "terms": { "color": ["red","green"] } },{ "terms": { "make": ["ford","honda"] } } ] }} , "aggs": { "group_by_make": {"terms": { "field": "make.keyword" },"aggs": { "group_by_color": {"terms": { "field": "color.keyword" } , "aggs": {"MTD" : {"filter" :{ "range" : { "sold" : { "gte" : "2014-01-01", "lte" : "2014-12-31" } } }  , "aggs" : {"cprice" : { "value_count" :{ "field" : "price"} },"sprice" : { "sum" :{ "field" : "price"} },"avgvalue1": {"avg": { "script" : {"inline": "doc['price'].value !=0 ? doc['price'].value/2 : 0" }}}}},"avgvalue2": {"bucket_script": { "buckets_path": {"formula_1_1" : "MTD.sprice","formula_2_1" : "MTD.cprice"},"script" : "params.formula_1_1 != 0 ? params.formula_2_1/params.formula_2_1 :0" }}} }} }}}} }
Caveat - I am not an elasticsearch expert. Elastic has provided a comprehensive document on features. You can get more details from their site. If you would like to download the sample data for these queries, please write a comment.

Solr Query: SQL Query

Apache Solr is an open source search platform. It can also be used for analysis of structured data. As a SQL developer, we always look for the equivalent of SQL Query on any other system. Just putting down how a SQL query will look on Solr. It is assumed that you have basic knowledge of Solr and SQL and able to install Solr. All the SQL queries are based on RDBMS. Though Solr is also supporting sql in latest release, we are using method to pass parameters in url
Table Structure used: Sales
city       Varchar , country      Varchar , Qty         Number , datadate   Date , Sales      Number ,Item      Varchar ,id         int, Category   Varchar , Cost      Number
All Solr query are executed on URL. A common URL for all queries is http://localhost:8983/solr/sales/select?indent=on {Indent is optional}
  1. Select All
SQL : Select * from Sales
Solr : ?q=*:*&wt=json {wt is response parameter that support csv,json,xml}
2. Selected Field
SQL : Select city , cost from sales
Solr : fl=city,cost&q=*:*&wt=json
3. Filters
SQL : Select * from sales where city = 'BOSTON' or Select * from sales where city = 'New York'
Solr : fq=city:BOSTON&q=*:*&wt=json or ?fq=city: "New York"&q=*:*&wt=json . If you do not give double quotes, it will search for both New and York as City =New or City = York. Use : in solr queries.
4. Filter with Date Range
SQL : select * from sales where city = 'BOSTON' and datadate between '2013-01-01' and '2013-12-31'. You might have to cast date in some of the databases.
Solr : fq=city: BOSTON &fq=datadate: ["2013-01-01" TO "2013-12-31"]&q=*:*&wt=json . fq need to used with each filter. Date based on supported formats
5. Group By: In Solr I prefer stats and facet to do this. In stats you need to mention measure and its Aggregations. You can rename measure and assign tag to create a group of measures. Tags are important to apply different filters on different combinations. To keep things simple, facet allows you to group the data. You can mention the group by columns.
SQL : Select city,sum(sales) sales from sales group by city
Solr : q=*&facet=true&stats=true&stats.field={!tag=a1 sum=true key=Sales}sales&facet.pivot={!key=Group1 stats=a1}city&rows=0&wt=json&indent=on
Here ! is only used with first property inside {}. Later we will also use it to name filter and exclude tag/s to get data of a period. Rows=0 will make sure that you get only aggregated data. If you do not keep rows =0. You will also get base data.
In Solr, we can likewise accomplish group by using below format
Solr : select?group.field=city&group=true&indent=on&q=*:*&wt=json
6. Formula , in v=case you want use a/b or a*b or a*2
SQL : Select city,sum(Qty*cost) sales from sales group by city
Solr : q=*&facet=true&stats=true&stats.field={!tag=a1 sum=true key=Sales func}mul(Qty,cost)&facet.pivot={!key=Group1 stats=a1}city&rows=0&wt=json
7. A complex Query : The SQL and Sol Query are not equivalent. In SQL, we will use union all to append three period data. In Solr we will get three different set of datas. In Solr we will use exclude filter property to achieve this
Sql : select city , sum(sales) sales_2015, sum(cost) cost_2015, null sales_2014 , null cost_2014 , null sales_2013 , null cost_2013 from sales where city = 'BOSTON' and datadate between '2015-01-01' and '2015-12-31' group by city union all
select city , null sales_2015, null cost_2015, sum(sales) sales_2014 , sum(cost) cost_2014 , null sales_2013 , null cost_2013 from sales where city = 'BOSTON' and datadate between '2014-01-01' and '2014-12-31' group by city
union all select city , null sales_2015, null cost_2015, null sales_2014 , null cost_2014 , sum(sales) sales_2013 , sum(cost) cost_2013 from sales where city = 'BOSTON' and datadate between '2014-01-01' and '2014-12-31' group by city

Solr: q=*&fq={!tag%3Df_city}city:("BOSTON")&fq={!tag=P2015}datadate:["2015-01-01"+TO "2015-12-31"]&fq={!tag=P2013}datadate:["2013-01-01" TO "2013-12-31"]&fq={!tag=P2014}datadate:["2014-01-01" TO "2014-12-31"]&facet=true&stats=true&stats.field={!tag=sP2015 sum=true key=Sales_1 ex=P2013,P2014}sales&stats.field={!tag=sP2015 sum=true key=cost_1 ex=P2013,P2014}cost&stats.field={!tag=sP2013 sum=true key=Sales_1 ex=P2015,P2014}sales&stats.field={!tag=sP2013 sum=true key=cost_1 ex=P2015,P2014}cost&stats.field={!tag=sP2014 sum=true key=Sales_1 ex=P2015,P2013}sales&stats.field={!tag=sP2014 sum=true key=cost_1 ex=P2015,P2013}cost&facet.pivot={!ex=P2013,P2014 key=gP2015 stats=sP2015 }city&facet.pivot={!ex=P2015,P2014 key=gP2013 stats=sP2013 }city&facet.pivot={!ex=P2015,P2013 key=gP2014 stats=sP2014 }city&rows=0&wt=json&indent=on
8. Open Item
How to do in solr Select city,sum(sales)/sum(qty) Avg_sales from sales group by city
I am not a Solr expert. There can be better solutions. Refer to Solr documentation.