Details
-
Documentation
-
Status: Open
-
Trivial
-
Resolution: Unresolved
-
3.1.2
-
None
-
None
Description
The S3 bucket that used for an example in "Integration with Cloud Infrastructures" document will be deleted on Jul 1, 2021 https://registry.opendata.aws/landsat-8/
The dataset will move to another bucket but it requires `--request-payer requester` option so users have to pay S3 cost. https://registry.opendata.aws/usgs-landsat/
So I think it's better to change the datasource like this.
https://github.com/yoda-mon/spark/commit/cdb24acdbb57a429e5bf1729502653b91a600022
I chose [NYC Taxi data| https://registry.opendata.aws/nyc-tlc-trip-records-pds/] here for an example.
Unlike landat data it's not compressed, but it is just an example and there are several tutorials using Spark (e.g. https://github.com/aws-samples/amazon-eks-apache-spark-etl-sample)
Reed test result
scala> sc.textFile("s3a://nyc-tlc/misc/taxi _zone_lookup.csv").take(10).foreach(println) "LocationID","Borough","Zone","service_zone" 1,"EWR","Newark Airport","EWR" 2,"Queens","Jamaica Bay","Boro Zone" 3,"Bronx","Allerton/Pelham Gardens","Boro Zone" 4,"Manhattan","Alphabet City","Yellow Zone" 5,"Staten Island","Arden Heights","Boro Zone" 6,"Staten Island","Arrochar/Fort Wadsworth","Boro Zone" 7,"Queens","Astoria","Boro Zone" 8,"Queens","Astoria Park","Boro Zone" 9,"Queens","Auburndale","Boro Zone"
Attachments
Issue Links
- relates to
-
HADOOP-19057 S3 public test bucket landsat-pds unreadable -needs replacement
- Resolved
-
HADOOP-17784 hadoop-aws landsat-pds test bucket will be deleted after Jul 1, 2021
- Resolved