Details
-
Task
-
Status: Open
-
Major
-
Resolution: Unresolved
-
None
-
None
-
None
Attachments
Issue Links
Activity
Commit 5fc93b607fa9b3b2d5dd359007721f79551a09d4 in systemds's branch refs/heads/main from Frederic Zoepffel
[ https://gitbox.apache.org/repos/asf?p=systemds.git;h=5fc93b607f ]
SYSTEMDS-3696 Extended incremental SliceLine state handling
Closes #2116.
Commit 3a73b77e4187d51ded0d0a5b81d32d3a1f407156 in systemds's branch refs/heads/main from Frederic Zoepffel
[ https://gitbox.apache.org/repos/asf?p=systemds.git;h=3a73b77e41 ]
SYSTEMDS-3696 Minor robustness fix and pruning flags
Closes #2107.
Commit 726d21d08aa417764123221e2f5ae95ff92bb4f9 in systemds's branch refs/heads/main from Matthias Boehm
[ https://gitbox.apache.org/repos/asf?p=systemds.git;h=726d21d08a ]
SYSTEMDS-3696 Additional pruning strategy for incremental slice line
This patch adds a very effective pruning strategy which yields up to
two orders of magnitude runtime improvements on Adult, Covtype, KDD98,
and USCenus. However, this strategy only gives high-probability
guarantees. In detail, we evaluate previously evaluated slices by
adding the contribution of added and removed tuples in order to
determine feature-wise high-probability upper bound scores which are
in turn used to eliminate basic (single-feature) slices early on.
Due to edge cases that might be missed, this strategy should not be
applied by default (even though the tests pass), which I will do
when handling #2107 because it also touches the pruning selector.
Commit a973b1107567cca27a4c23ec3e230e17f00f46e7 in systemds's branch refs/heads/main from Frederic Zoepffel
[ https://gitbox.apache.org/repos/asf?p=systemds.git;h=a973b11075 ]
SYSTEMDS-3696 Fix edge cases incremental sliceline
Closes #2106.
Commit 95cfb76fee57e92d20c94b26b445d91819dfc5ee in systemds's branch refs/heads/main from Matthias Boehm
[ https://gitbox.apache.org/repos/asf?p=systemds.git;h=95cfb76fee ]
SYSTEMDS-3696 Fix incremental SliceLine naming conflicts in namespace
This patch fixes the SliceLine builtin in order to allow joint use of
SliceLine and incSliceLine without any naming conflicts in the
.builtin namespace.
Commit c1e8500e0704e0f254799f0425ee50006920b7b3 in systemds's branch refs/heads/main from Matthias Boehm
[ https://gitbox.apache.org/repos/asf?p=systemds.git;h=c1e8500e07 ]
SYSTEMDS-3696 Improved incremental slice line (pruning unchanged)
This patch improves the pruning of unchanged slices below min-support
with a more efficient selection and matching against enumerated slices.
Now, on Adult the first incSliceLine runs in 27s (similar to sliceLine)
but the second incSliceLine with few additional tuples runs in 3s.
Commit 472e69fb2ef7d0b30662d1ca313c1b25628f1a94 in systemds's branch refs/heads/main from Matthias Boehm
[ https://gitbox.apache.org/repos/asf?p=systemds.git;h=472e69fb2e ]
SYSTEMDS-3696 Additional pruning in incremental slice line
Besides additional cleanups and smaller improvements, this patch adds
a new pruning strategy that computes for unchanged slice the
maximal reachable scores from previous runs, scales them according to
the new datasize and average errors and utilizes these scores to
prune all features who's maxsc is smaller than 0 or the scores of the
previous top-k set evaluated on the new data.
Commit 8b5d4cc2419b56877f0028e2d451c10d83327fdd in systemds's branch refs/heads/main from Matthias Boehm
[ https://gitbox.apache.org/repos/asf?p=systemds.git;h=8b5d4cc241 ]
SYSTEMDS-3696 Fix incremental slice line (unchanged pruning)
Commit 254d680e465b3ccdf247878ef5f665ad12828daa in systemds's branch refs/heads/main from Matthias Boehm
[ https://gitbox.apache.org/repos/asf?p=systemds.git;h=254d680e46 ]
SYSTEMDS-3696 Improve incremental slice line (cleanup, robustness)
- robust top-K maintenance for continuous score pruning
w/o special cases for minsc handing - cleanup pruning strategies of basic input slices
- robustness for -Inf in previous top-K evaluation
- various vectorization of individual code snippets
- improved error handling (via stop)
Commit 5283544289b708e32756d5b145c01093fa032c4c in systemds's branch refs/heads/main from Frederic Zoepffel
[ https://gitbox.apache.org/repos/asf?p=systemds.git;h=5283544289 ]
SYSTEMDS-3696 Extended incremental slice line (pruning selector)
Closes #2098.
Commit f4e53ba17a4147ecfacb10b0c905f09397d7545b in systemds's branch refs/heads/main from Matthias Boehm
[ https://gitbox.apache.org/repos/asf?p=systemds.git;h=f4e53ba17a ]
SYSTEMDS-3696 Performance improvements incremental slice finding
This patch is a performance fix-pack for incremental SliceLine, which
improved its runtime from 90.4 to 52.2s on a particular scenario with
the Adult dataset. In detail, the modifications include:
- vectorized one-hot encoding: O(m^2*n^2) -> O(m*n)
- vectorized scoring of previous top-k set
- vectorized pruning of unchanged slices
- vectorized removal of deleted tuples: O(n^2) -> O
Furthermore, this patch also cleans up the wrong formatting (spaces
instead of tabs) of the incremental slice finder tests.
Commit 4b2a3ca7823599f19be23aa41038e658cdd0ff4e in systemds's branch refs/heads/main from Frederic Zoepffel
[ https://gitbox.apache.org/repos/asf?p=systemds.git;h=4b2a3ca782 ]
SYSTEMDS-3696 Improved incremental slice-line buitin
Closes #2063.
Commit 54d0a65145aa43338da4df55e75e6e1fa598e8e3 in systemds's branch refs/heads/main from Frederic Zoepffel
[ https://gitbox.apache.org/repos/asf?p=systemds.git;h=54d0a65145 ]
SYSTEMDS-3696 Improved incremental SliceLine (previous stats)
Closes #2039.
Commit 9e99f3c4c3bec42299fa5e48a0cb3bc3aea264be in systemds's branch refs/heads/main from Matthias Boehm
[ https://gitbox.apache.org/repos/asf?p=systemds.git;h=9e99f3c4c3 ]
SYSTEMDS-3696 New sliceLineDebug built-in function for usability
This patch adds a new sliceLineDebug function to present the top-k
worst-slides returned from sliceLine (slicefinder) in a human
readable format. This is the output for the Salaries dataset:
sliceLineDebug:
– Slice #1: score=0.4041683676825298, size=248.0
---- avg error=6.558681888351787E8, max error=8.524558818262574E9
---- predicate: "rank" = "Prof" AND "sex" = "Male"
– Slice #2: score=0.3731763935666855, size=42.0
---- avg error=8.271958572009121E8, max error=4.553584116646141E9
---- predicate: "rank" = "Prof" AND "yrs.since.phd" = 31.25
– Slice #3: score=0.3675193573989536, size=125.0
---- avg error=6.758211389786526E8, max error=8.524558818262574E9
---- predicate: "rank" = "Prof" AND "discipline" = "B" AND "sex" =
"Male"
– Slice #4: score=0.35652331744984933, size=266.0
---- avg error=6.307265846260264E8, max error=8.524558818262574E9
---- predicate: "rank" = "Prof"
Commit 5ec8d0c06a99cdf1250d2d85c6dbc8e43e84ea19 in systemds's branch refs/heads/main from Frederic Zoepffel
[ https://gitbox.apache.org/repos/asf?p=systemds.git;h=5ec8d0c06a ]
SYSTEMDS-3696 Basic incremental slice-line builtin, and tests
Closes #2024.
Commit 827438b953ed2bc6f6f4cc30a34df50231bea050 in systemds's branch refs/heads/main from Matthias Boehm
[ https://gitbox.apache.org/repos/asf?p=systemds.git;h=827438b953 ]
SYSTEMDS-3696 Fix incSliceLine flag for pruning strategies
Recent experiments revealed that even with disabled pruning strategies
incSliceLine was still faster than sliceLine on some datasets because
the reevaluated top-K set was passed to the current top-K set from the
beginning and thus used for additional score pruning. We now prevent
this if score pruning is disabled.