Details
-
Bug
-
Status: Open
-
Major
-
Resolution: Unresolved
-
1.4.0
-
None
-
None
Description
A common practice is to use directories below a main directory as a partitioning device. Say you have a table named "myawesomedata" and you get data into that table every day, it would be valuable to create the main directory, then subdirectories per day to help optimize queries running against only certain days of data.
/myawesomedata/
/myawesomedata/2016-02-01
/myawesomedata/2016-02-02
/myawesomedata/2016-02-03
/myawesomedata/2016-02-04
I have identified a condition that if there is ONLY one subdirectory, queries do not return results as expected by a user.
Example:
In the above, if I run a query of
select count(1) from `myawesomedata`;
I get accurate results of the count in all subdirectories
If I run:
select count(1) from `myawesomedata` where dir0 = '2016-02-01';
I get accurate results of the count of only the subdirectory 2016-02-01
However, if I delete subdirectories 2016-02-02, 2016-02-03, and 2016-02-04 and am left with:
/myawesomedata/
/myawesomedata/2016-02-01
Then if I run
select count(1) from `myawesomedata`;
It returns the accurate count (which is just that of the 2016-02-01 directory).
However, if I run
select count(1) from `myawesomedata` where dir0 = '2016-02-01';
It takes much longer (15 seconds vs instant on the other queries) and returns no results. Even though this is the same query as above that worked with 2 or more subdirectories. Basically, when there is only one subdirectory, a query asking for only that directory does not work in the same way as when there are more subdirectories. This is an unexpected user experience and something I believe could cause user frustration and unexpected results from Drill usage on data.