Skip to main content

Command Palette

Search for a command to run...

DataScraperX 1000: Simplified Data Extraction

Updated
β€’3 min read
DataScraperX 1000: Simplified Data Extraction

DataScraperX 1000: Appwrite Hashnode Hackathon

πŸ€” Team Details

Description:

DataScraperX 1000 stands out as a powerful and versatile data extraction tool, specifically tailored for image retrieval. One of its key strengths lies in its simplified user interface, enabling users to navigate through the extraction process with ease (This user-friendly design is achieved by my lack of design and frontend creativity.)

Here's how the process typically works with DataScraperX 1000:

  1. Topic Entry: Users start by entering the specific topic or keyword they want to retrieve images for. This could be anything from "beach landscapes" to "modern architecture" or "hot dog."

  2. Source Selection: DataScraperX 1000 provides a list of available sources from which images can be extracted. These sources may include popular search engines, social media platforms, image-hosting websites, or specific image databases. Users can select the desired sources based on their preferences and requirements.

  3. Initiation of Extraction: After that, users can initiate the image extraction process. DataScraperX 1000 then leverages its (advanced search algorithms) aka different sources API and web scraping to crawl the selected sources, searching for images relevant to the specified topic...

  4. Export and Usage: Once the desired images are identified, users can export them for further use by downloading them.

The motivation behind doing this project was my interest in trying out machine learning for image classification. As I scoured through several online resources, I couldn't help but notice the significance of data in ML image classification was consistently emphasized. So I developed this project to simplify the process of obtaining relevant images.

🧰 Tech Stack

Technologies:

  • Appwrite | Backend taking care of functions, Databases, and File Storage πŸ—ƒοΈ

  • Next.js | Rendering framework for React ecosystem and routing πŸ€–

  • TypeScript | Programming language to keep website strongly typed

  • Python | Programming language for its simplicity, readability, and ecosystem

  • Chakra UI | UI component library that gives you the building blocks for nice looks

  • Auth UI | An authentication wrapper for the Appwrite

Production:

πŸ’’ Challenges I Faced

For now, at least for me auth ui works only on Chrome browser.

Adding images to bucket parallely.

Relative module import not working properly

Subreddits went down on 12 June because of API changes and I started this project week before that without knowing the details.

In the cloud error logs not showing off a failed function.

Download files parallel file from the bucket to the client.

In my code, I implemented parallel execution of different functions to reduce processing time. However, I was uncertain about how the Python interpreter handles concurrent executions and the order in which the functions would run.

Not able to use middleware in nextjs for protected routes just using context for now.

Get the states(isLoading, error, onSuccess, etc...) for client-side appwrite CRUD operations.

πŸ”— Public Code Repo

GitHub: AnujSsStw/data_scrap

Live Demo: βš’οΈ DataScraperX 1000 πŸ˜Άβ€πŸŒ«οΈ

#appwrite #AppwriteHackathon

Note-> The generation step is going to take time if the topic is new.