[SPARK-36024] Switch the datasource example due to the depreciation of the dataset - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Documentation
Status: Open
Priority: Trivial
Resolution: Unresolved
Affects Version/s: 3.1.2
Fix Version/s: None
Component/s: Documentation
Labels:
None

Description

The S3 bucket that used for an example in "Integration with Cloud Infrastructures" document will be deleted on Jul 1, 2021 https://registry.opendata.aws/landsat-8/

The dataset will move to another bucket but it requires `--request-payer requester` option so users have to pay S3 cost. https://registry.opendata.aws/usgs-landsat/

So I think it's better to change the datasource like this.

https://github.com/yoda-mon/spark/commit/cdb24acdbb57a429e5bf1729502653b91a600022

I chose [NYC Taxi data| https://registry.opendata.aws/nyc-tlc-trip-records-pds/] here for an example.
Unlike landat data it's not compressed, but it is just an example and there are several tutorials using Spark (e.g. https://github.com/aws-samples/amazon-eks-apache-spark-etl-sample)

Reed test result

scala> sc.textFile("s3a://nyc-tlc/misc/taxi _zone_lookup.csv").take(10).foreach(println) "LocationID","Borough","Zone","service_zone" 1,"EWR","Newark Airport","EWR" 2,"Queens","Jamaica Bay","Boro Zone" 3,"Bronx","Allerton/Pelham Gardens","Boro Zone" 4,"Manhattan","Alphabet City","Yellow Zone" 5,"Staten Island","Arden Heights","Boro Zone" 6,"Staten Island","Arrochar/Fort Wadsworth","Boro Zone" 7,"Queens","Astoria","Boro Zone" 8,"Queens","Astoria Park","Boro Zone" 9,"Queens","Auburndale","Boro Zone"

Attachments

Issue Links

relates to

HADOOP-19057 S3 public test bucket landsat-pds unreadable -needs replacement

Resolved

HADOOP-17784 hadoop-aws landsat-pds test bucket will be deleted after Jul 1, 2021

Resolved

Activity

People

Assignee:: Unassigned

Reporter:: Leona Yoda

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 06/Jul/21 05:01

Updated:: 31/Jan/24 15:59