Job Responsibilities -
• Architect and implement a scalable, offline Data Lake for structured, semi-structured, and unstructured data in an on-premises, air-gapped environment.
• Collaborate with Data Engineers, Factory IT, and Edge Device teams to enable seamless data ingestion and retrieval across the platform.
• Integrate with upstream systems like MES, SCADA, and process tools to capture high-frequency manufacturing data efficiently.
• Monitor and maintain system health, including compute resources, storage arrays, disk I/O, memory usage, and network throughput.
• Optimize Data Lake performance via partitioning, deduplication, compression (Parquet/ORC), and implementing effective indexing strategies.
• Select, integrate, and maintain tools like Apache Hadoop, Spark, Hive, HBase, and custom ETL pipelines suitable for offline deployment.
• Build custom ETL workflows for bulk and incremental data ingestion using Python, Spark, and shell scripting.
• Implement data governance policies covering access control, retention periods, and archival procedures with security and compliance in mind.
• Establish and test backup, failover, and disaster recovery protocols specifically designed for offline environments.
• Document architecture designs, optimization routines, job schedules, and standard operating procedures (SOPs) for platform maintenance.
• Conduct root cause analysis for hardware failures, system outages, or data integrity issues.
• Drive system scalability planning for multi-fab or multi-site future expansions.
Essential Attributes (Tech-Stacks) -
• Hands-on experience designing and maintaining offline or air-gapped Data Lake environments.
• Deep understanding of Hadoop ecosystem tools: HDFS, Hive, Map-Reduce, HBase, YARN, zookeeper and Spark.
• Expertise in custom ETL design, large-scale batch and stream data ingestion.
• Strong scripting and automation capabilities using Bash and Python.
• Familiarity with data compression formats (ORC, Parquet) and ingestion frameworks (e.g., Flume).
• Working knowledge of message queues such as Kafka or RabbitMQ, with focus on integration logic.
• Proven experience in system performance tuning, storage efficiency, and resource optimization.
Qualifications -
• BE/ ME in Computer science, Machine Learning, Electronics Engineering, Applied mathematics, Statistics.
Desired Experience Level -
• 4 Years relevant experience post Bachelors
• 2 Years relevant experience post Masters
• Experience with semiconductor industry is a plus
Any question or remark? just write us a message
If you would like to discuss anything related to payment, account, licensing,
partnerships, or have pre-sales questions, you’re at the right place.