Uploaded image for project: 'SystemDS'
  1. SystemDS
  2. SYSTEMDS-1277

DataFrames With `mllib.Vector` Columns Are No Longer Converted to Matrices.

Details

    • Bug
    • Status: Closed
    • Blocker
    • Resolution: Fixed
    • SystemML 0.13
    • SystemML 0.13
    • None
    • None

    Description

      Recently, we made the switch from the old mllib.Vector to the new ml.Vector type. Unfortunately, this leaves us with the issue of no longer recognizing DataFrames with mllib.Vector columns during conversion, and thus, we (1) do not correctly convert to SystemML Matrix objects, (2) instead fall back on conversion to Frame objects, and then (3) fail completely when the ensuing DML script is expecting to operated on matrices.

      Given a Spark DataFrame X_df of type DataFrame[__INDEX: int, sample: vector], where vector is of type mllib.Vector, the following script will now fail (did not previously):

      script = """
      # Scale images to [-1,1]
      X = X / 255
      X = X * 2 - 1
      """
      outputs = ("X")
      script = dml(script).input(X=X_df).output(*outputs)
      X = ml.execute(script).get(*outputs)
      X
      
      Caused by: org.apache.sysml.api.mlcontext.MLContextException: Exception occurred while validating script
      	at org.apache.sysml.api.mlcontext.ScriptExecutor.validateScript(ScriptExecutor.java:487)
      	at org.apache.sysml.api.mlcontext.ScriptExecutor.execute(ScriptExecutor.java:280)
      	at org.apache.sysml.api.mlcontext.MLContext.execute(MLContext.java:293)
      	... 12 more
      Caused by: org.apache.sysml.parser.LanguageException: Invalid Parameters : ERROR: null -- line 4, column 4 -- Invalid Datatypes for operation FRAME SCALAR
      	at org.apache.sysml.parser.Expression.raiseValidateError(Expression.java:549)
      	at org.apache.sysml.parser.Expression.computeDataType(Expression.java:415)
      	at org.apache.sysml.parser.Expression.computeDataType(Expression.java:386)
      	at org.apache.sysml.parser.BinaryExpression.validateExpression(BinaryExpression.java:130)
      	at org.apache.sysml.parser.StatementBlock.validate(StatementBlock.java:567)
      	at org.apache.sysml.parser.DMLTranslator.validateParseTree(DMLTranslator.java:140)
      	at org.apache.sysml.api.mlcontext.ScriptExecutor.validateScript(ScriptExecutor.java:485)
      	... 14 more
      

      Attachments

        Activity

          This fixed my my real-world case. Thanks, deron!

          dusenberrymw Mike Dusenberry added a comment - This fixed my my real-world case. Thanks, deron !

          This is addressed by PR397.

          mwdusenb@us.ibm.com Could you resolve this issue if it works with your real-world data example?

          xwu0226 Mike hit this issue working on the SystemML Breast Cancer project which involves deep learning. See PR347. We recently updated SystemML from mllib.Vector to the newer ml.Vector. The fix is to simply support both formats.

          deron Jon Deron Eriksson added a comment - This is addressed by PR397 . mwdusenb@us.ibm.com Could you resolve this issue if it works with your real-world data example? xwu0226 Mike hit this issue working on the SystemML Breast Cancer project which involves deep learning. See PR347 . We recently updated SystemML from mllib.Vector to the newer ml.Vector. The fix is to simply support both formats.
          xwu0226 Xin Wu added a comment -

          Is this issue also for Deep Learning?

          xwu0226 Xin Wu added a comment - Is this issue also for Deep Learning?

          Adding the following fixes the issue, so we should just add the similar wrappers at the Java MLContext layer.

          # Convert DataFrame columns of type `mllib.Vector` to type `ml.Vector`
          X_df = MLUtils.convertVectorColumnsToML(X_df)
          
          dusenberrymw Mike Dusenberry added a comment - Adding the following fixes the issue, so we should just add the similar wrappers at the Java MLContext layer. # Convert DataFrame columns of type `mllib.Vector` to type `ml.Vector` X_df = MLUtils.convertVectorColumnsToML(X_df)

          Update: Here's the official word on DataFrame conversions from the old mllib.Vector to ml.Vector: https://spark.apache.org/docs/2.0.0/ml-guide.html#breaking-changes.

          dusenberrymw Mike Dusenberry added a comment - Update: Here's the official word on DataFrame conversions from the old mllib.Vector to ml.Vector : https://spark.apache.org/docs/2.0.0/ml-guide.html#breaking-changes .

          Also, just to follow up, the ml.Vector type should remain the standard default, as Spark is moving away from mllib.Vector. However, since DataFrames created and saved with mllib.Vector types can still be used (and often without the user realizing that a saved DataFrame would maintain a distinct separation between the two types), it's plausible that a user will try to run the same SystemML code with the same DataFrame as before, and thus run into issues now. We could just catch any mllib.Vector types and convert to ml.Vector with mllib.Vector.asML which does not make any copy of the data --> https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.mllib.linalg.Vector.

          dusenberrymw Mike Dusenberry added a comment - Also, just to follow up, the ml.Vector type should remain the standard default, as Spark is moving away from mllib.Vector . However, since DataFrames created and saved with mllib.Vector types can still be used (and often without the user realizing that a saved DataFrame would maintain a distinct separation between the two types), it's plausible that a user will try to run the same SystemML code with the same DataFrame as before, and thus run into issues now. We could just catch any mllib.Vector types and convert to ml.Vector with mllib.Vector.asML which does not make any copy of the data --> https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.mllib.linalg.Vector .

          cc deron

          dusenberrymw Mike Dusenberry added a comment - cc deron

          People

            deron Jon Deron Eriksson
            dusenberrymw Mike Dusenberry
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: