ML Case-study Interview Question: Real-Time Slack Event Summarization Using Large Language Models
Browse all the ML Case-Studies here.
Case-Study question
You lead a Data Science team at a company that manages real-time events. These events often involve multiple stakeholders communicating in Slack channels. The team wants to build a system that uses Large Language Models to generate summary suggestions based on the real-time conversation and metadata. Design and implement a feature that generates a concise event summary after each update, then propose how you would measure its effectiveness and safeguard against inaccurate or irrelevant suggestions.
Detailed Solution
This solution focuses on building a system that captures Slack channel data in real time, sends it to a Large Language Model, and returns summary suggestions that are both context-aware and efficient.
Data Ingestion and Processing
Set up a pipeline to capture Slack messages, event updates, timestamps, and any relevant metadata. Store this in a secure database. Decide which data to redact or omit to satisfy privacy rules. Make sure you have the legal and compliance approvals for using any external model provider.
Core Prompting Strategy
Craft a prompt that includes:
The current event name or description.
Previous summary text (if it exists).
All relevant messages and updates within a certain time window.
Shape the prompt so the model consistently focuses on key details. Test how small rephrasings change the model’s output. Keep a versioned record of each tested prompt.
Tooling for Prompt Iteration
Create a script to run different prompt variations against real or synthetic data (representative of real events). Automate test runs on edge cases such as:
Minimal data.
Extremely large channels with many messages.
Noisy or irrelevant chatter.
Collect outputs for evaluation.
Ensuring Human Oversight
Present summary suggestions via an interactive UI. Offer a direct option to accept, edit, or reject. Track the frequency of each action. For acceptance rate, define:
acceptance_rate = accepted_summaries / (accepted_summaries + edited_summaries + rejected_summaries)
Explanations of the parameters:
accepted_summaries: Count of times a suggestion is accepted without modification.
edited_summaries: Count of times a suggestion is edited then accepted.
rejected_summaries: Count of times a suggestion is discarded entirely.
A high acceptance_rate indicates valuable suggestions. A large ratio of edited_summaries signals partially useful outputs. A high rejection rate might mean the model’s prompt or data context is flawed.
Implementation Details
Use a modular function that orchestrates the entire workflow:
Check if the customer has agreed to share their data with the LLM provider.
Fetch Slack messages and updates up to a target timestamp.
Build and send the prompt to the model.
Parse the raw model response, storing it alongside metadata.
Return the suggestion to the end user.
Encapsulate the LLM logic in a service layer to let future features reuse the same pipeline with minimal extra code.
Handling Edge Cases
Expect the model to sometimes introduce fictional content or suggest steps outside your real process. Reduce these errors by:
Instructing the model clearly on the format of the output (for example, returning structured JSON).
Asking it to only summarize known facts.
Giving explicit constraints about any forbidden outputs or disclaimers.
Ongoing Improvements
Continuously refine prompts based on user feedback. If you detect a pattern in rejections or frequent edits, investigate potential prompt rework. Maintain open communication with stakeholders about uncertain timelines. AI-based deliverables can need extra iteration time.
Follow-up Questions and In-depth Explanations
1) How do you evaluate the quality of the summaries beyond the acceptance rate?
One way is user surveys. Ask stakeholders if the summary conveys the right technical and non-technical context. Track the time saved by using automatic suggestions vs writing from scratch. Compare user engagement (like fewer Slack pings asking for context) before and after introducing AI-based summaries. Examine if fewer post-mortem clarifications are needed when well-crafted summaries are provided.
Quantitative approaches include measuring summary length, checking for repeated or redundant phrases, and verifying if crucial data points (such as root cause details or active steps) are present. You can build a lightweight script that parses the model’s output and flags whether it includes critical keywords (like error codes or environment names) that often matter during events.
2) What if the model leaks sensitive content?
Prevent leaks by stripping out personal or sensitive tokens before sending data to the model. Any personally identifiable information or high-risk data should be obfuscated. In the final response, keep placeholders for sensitive elements. For example, remove user email addresses or internal URLs. Have an automated filter that scans raw Slack content before sending it to the LLM.
Protect your model output by limiting it to relevant fields. If your logs show that the model returns potential secrets, refine the prompt with stricter instructions about ignoring or removing sensitive info. Consider using a zero-data retention policy from your vendor if possible.
3) How would you handle time constraints with large volumes of Slack data?
Implement a summarization step that clusters messages and picks the most critical lines before sending them to the model. This preprocessing can reduce your token usage and latency. Also, run asynchronous requests if real-time responses are not mandatory. If near real-time is required, cache partial summaries at intervals and only send incremental updates to the model.
4) Why not auto-update the summary without user confirmation?
Auto-accepting poses a risk. Inaccurate or misleading summaries create confusion and reduce trust in the system. People may disable the feature entirely if they face repeated bad updates. Letting users approve or edit suggestions preserves trust. Over time, if acceptance rates are consistently high, you might experiment with auto-updating for certain well-understood scenarios.
5) How do you adapt prompt engineering if you switch to a different LLM provider?
Keep your prompting logic in a central interface. Then create provider-specific adapters. Each adapter handles the minor differences in token count, prompt structure, or format. Test new prompts the same way you did with the previous provider. Reuse your automated test framework to compare outputs from different endpoints.
6) How would you integrate structured outputs into the final summary?
Instruct the model to return JSON. Parse the JSON fields, reformat them if needed, then render them into a user-facing summary. This approach ensures that essential elements like cause, steps taken, or final resolution are always present and in the correct order.
7) Why is it important to keep the rest of the organization updated on AI feature timelines?
AI engineering can be unpredictable. Prompt improvements might take a few hours or spill into days of trial-and-error. Frequent check-ins with product and leadership teams prevent misaligned expectations. Over-commitment without factoring in these unknowns can derail sprints or set unrealistic milestones.
8) How do you iterate quickly but still maintain safety?
Run limited pilots with a small user group. Gather feedback and usage metrics. Roll back or refine quickly if you see major issues. Version your prompts so you can revert if a new one introduces more errors. Keep thorough logs of all model outputs and user interactions. This helps debug mistakes and ensures accountability.
9) How would you extend this approach to other AI-driven features?
Reuse your orchestrator function and data pipeline. For additional features, define a new prompt interface. Feed relevant data to the model and plug the output into a user-facing component. If you have well-tested tooling, you can spin off new AI features with minimal overhead. For example, you could generate suggested follow-up tasks, recommended timeline updates, or potential root-cause guesses by reusing the same pipeline code and adjusting the prompt text.