This bounty is no longer available
Web3 DAO | fairdatasociety Logo

Wikipedia on Swarm

Organization

fairdatasociety

Deadline

about 2 years ago

Status

ENDED

50000 USD

INSTRUCTIONS

Motivation:

  • E.T. wants to access Wikipedia, but the site is no longer accessible on the web.
  • They want to find a way to download the entire Wikipedia and access it offline, but this download requires a lot of traffic (500 GB of wiki mirror) and is difficult to update - there surely must be an easier, faster and more efficient way to access Wikipedia.
  • Fortunately, the Swarm network hosts a mirror of Wikipedia, so anyone connected to it can access and search through the complete set of information.
  • Wikipedia on Swarm is updated periodically (at minimum on a monthly basis) and is always accessible through the same link and the same means.

Goals:

  • Maintain a mirror of English wikipedia on Swarm, complying with all necessary licences. It should be able to handle non-latin alphabet characters (uploading languages like Spanish, Russian, Czech, Arabic, Farsi, Korean, Japanese and Chinese).
  • Create a reusable solution that provides broader utility - components can be reused to upload large collections of small files like other ZIM archives (e.g. Project Gutenberg) or OpenStreetMap data.
  • Create or modify a web interface and/or an app to allow searching and reading of Swarm hosted content.
  • Anyone with a devops background should be able to run the solution. Nonetheless, we expect the winner(s) of the bounty to run and maintain the tools (and they may qualify for a Fellowship in return). The solution needs to be open source and well documented.

Note: Hosting of a database of this scale on Swarm has not been efficiently automated yet. As of today, the Bee client can reliably upload and retrieve small files. For larger datasets, an efficient mechanism for upload should be implemented.

Technical requirements:

  • Create a pipeline built from a set of independent components that observes Wikipedia dumps and uploads them to the Swarm network.
  • The design of the interfaces as well as the actual modularisation between these components is up to you. Below is a suggested pipeline. The only component we would like to always keep separate is the Uploader.
  • +---------+ +------------+ +-----------+ +----------+ +----------+
  • | Trigger | -> | Downloader | -> | Extractor | -> | Enhancer | -> | Uploader |
  • +---------+ +------------+ +-----------+ +----------+ +----------+
  • Trigger: triggers the upload
    • It either watches the repository with ZIM archives or, through some other means, triggers the build periodically as new versions of ZIM are released.
    • Good place to download these archives from is https://dumps.wikimedia.org/other/kiwix/zim/wikipedia/
    • Downloader: download specific ZIM archive
    • Any optimisations such as checksum validation or some form of streaming into the next step in the pipeline would be appreciated but not required.
  • Extractor: extracts the archive
    • Here you can get inspired by the great submissions to We Are Millions hackathon.
  • Enhancer (optional but recommended)
    • Content aware step which enriches the files with additional information such as checksum of the uploaded ZIM file, the date it was released, ENS name, etc. This should however be deterministic and replicable.
    • It can also add UX features like search mechanism and any such addition will be considered for the bounty.
  • Uploader: upload the archive content to Swarm network.
    • A reliable mechanism for uploads of large datasets should be created.
    • One approach is to upload the files individually and then create a custom manifest where you link all the files together. You can get inspired by EdgarBarrantes/swarm-zim-uploader, but feel free to experiment with new approaches as well.
    • The Uploader should handle any errors or problems that may occur and continue uploading.
    • The mechanism should support streaming of the content to upload (and should start uploading right away) as well as just mounting a resource and uploading that.
    • The output of this step is a Swarm hash (can of course output additional data).
    • Handling of postage stamps, etc. can be done through gateway-proxy or built into the solution.
    • This component needs to be very much agnostic and be able to upload any collection of small files and nested directories.
  • Feeds and ENS (bonus)
    • The resulting Swarm hash could and most likely should be stored in a feed.
    • This feed is separately stamped and re-uploaded to ensure it does not disappear from the Swarm network.
    • The hash should be updated each time in an ENS record or this ENS record should store the feed. The design is up to you and will be considered when awarding the bounty.
  • The solution should comprise of at least two independent docker containers (one of them being the Uploader) that can be chained one after the other with clear interfaces between them. Splitting it into smaller services that do one thing would be appreciated.
  • In addition, any improvements such as a performant decentralised search mechanism using Swarm or Swarm and external decentralised services would be welcomed.

Assessment Criteria Internal:

  • Can the solution upload full English Wikipedia?
  • Is there an uploaded version of English Wikipedia and other languages?
  • Can we run the product with the documentation provided on AWS linux machine?
  • What is the final product’s user experience? Does it have a search mechanism? How does it perform? We’d love the bounty to invite any innovation.
  • Does it implement any optimisations? E.g. uploading only what changed, advanced stamp management, restamping chunks, reuploading missing chunks, streaming…
  • How much time/resources does it consume?

External:

  • Meeting the listed requirements.
  • Swarm hash pointing to a full English Wikipedia uploaded to Swarm through your solution.
  • Quality of implementation: code, documentation, technical excellence.
  • Quality of user experience (with regards to uploading and, more importantly, with regards to using the uploaded wikipedia).
  • How innovative is your solution, i.e. does it have some additional features like search mechanism or a mechanism ensuring that the content is available (on demand reuploading or global pinning)?
  • Technical complexity and optimisations.

Code of Conduct Let’s build exciting things.

Projects need to support privacy, data interoperability and data sovereignty where applicable. For more details, get familiar with Fair data society's principles.

Prize challenge: This bounty has a total value of 50k DAI to be disbursed in BZZ (as of the price of BZZ on the day of the payout as determined by Swarm Association). Prize can be awarded to the same project, or can be split and assigned to different ones, whatever the judging committee will deem most appropriate. The winner of the first prize will receive at least 20k DAI. The remaining 30k DAI will be distributed to the winner of the first prize, or to other winners, up to the 5th place, or whatever the judging committee will deem most appropriate. If according to the judging committee no project matches all the criteria mentioned above, no prizes will be awarded and the deadline might be extended.

The most promising projects may also be contacted for a Fellowship.

Submission requirements A final delivery will include:

  • Open source code licence:
    • any licence for the project shall include terms substantially similar to those of version 1.9 of the Open Source Definition, promulgated by the Open Source Initiative at https://opensource.org/osd-annotated.
  • Documentation: how to run, how it works
  • Recorded video demonstration of the working solution
  • A working link to a public Github or Gitlab repository containing: the code, presentations, demo, documentation, and licence information
  • The submitted solution must be a working product ready to be deployed in production
  • The solution uses Swarm network as the underlying storage

The judging panel will attempt to retrieve several random pages to ensure that English Wikipedia has been uploaded to Swarm mainnet as required for this bounty. Projects that do not pass this test will be discarded.

Eligibility Employees, contractors, or officers of Swarm Association and their affiliates are not eligible to participate in the bounty.

Participants can register as a team or as individuals. Participants can either join other teams or work alone. We believe in collaboration and encourage participants to work together.

Timeline The deadline for submitting your project is on August 31st 2022 at 16:00 CET.

Winners will be announced within 30 days of the deadline.

Important Links Swarm Gateway White Paper Bee Documentation Swarm Discord FDS Github FDS Discord

No Liability The participant acknowledges and agrees that, to the fullest extent permitted by law, he/she will not hold Swarm liable for any and all damages or injury whatsoever caused by or related to his/her participation to the bounty under any cause or action whatsoever of any kind, including, without limitation, actions for breach of warranty, breach of contract or tort (including negligence) and that Swarm shall not be liable for any indirect, incidental, special, exemplary or consequential damages, including for loss of profits, goodwill or data, in any way whatsoever arising out of the participation to the bounty.

Governing Law and Jurisdiction These terms as well as all matters arising out or in relation to them shall be governed by the laws of Switzerland, to the exclusion of the rules on conflicts of laws.

Any claim or dispute regarding these terms or in relation to them shall be subject to the exclusive jurisdiction of the Courts of Neuchâtel, Switzerland, subject to an appeal at the Swiss Federal Court.