This chat focuses on your past experience there’s not much you need to do to prepare. However, it would be advisable to pull your thoughts together ahead of time, and here’s the structure you can expect:

  • Experience - covering the type of roles you’ve held, the breadth and depth of responsibility, and size of projects managed. This is also your opportunity to showcase your content expertise in your field of work
  • Technical breadth and depth - At FB, we emphasize collaboration across multiple teams and individuals, so be able to talk about how your work has spanned multiple teams and/or iterations
  • TL (project management) skills - including technical mentoring – think about your role in the setup, execution and delivery of a project
  • People management skills - including upward influencing, mentorship, empathy, people growth etc.
  • Agility - Indicator of your growth opportunities, as measured through your capability and willingness to absorb knowledge from your experience, and use these lessons to become even more effective in new and different situations
  • Motivation - What would you be interested in working on at Facebook, or why are you interested in working to make Facebook better? Do you exhibit a strong general drive and desire to progress as a leader?
  • Meta/Barlas things: IR, DPR, Llama 2, GenAI, LoRA

  • Also, don’t be afraid to be vulnerable and talk about difficult subjects, show senior level leadership qualities as this is a senior role.

Interview questions from earlier

  • How to design LLM end to end, query -> song return
  • Read Eugene Yan LLM
  • Gen AI text diffusion
  • prompt, what metadata
  • Equation hallucination
  • Fine tuning Bert two losses
  • Losses evaluation
  • Mean average precision at k
  • Papers personalizatiom recommender system
  • Amazon llm articles
  • How do you measure diversity
  • do you finetune the LLM or not
  • Hallucination vs CT^2


Amazon Music

  • Currently, at Amazon, I lead the Music team for query understanding (map to standard functions, play Music) (finetuned an LLM input text, output is the API call. Trained base LLM, finetuned with API data with human annotations) and personalization.
    • So as a user interacts with an Alexa device, say they say “play music”, we need to help understand the request and personalize it with the customers detail.
  • Read LLM + Recsys

Finetuning LLM process

  1. Data Collection:
    • Gather labeled data: Assemble a dataset where each input represents a user’s music-related query, and the corresponding output is the appropriate API function, such as playMusic() or playLatestMusic().
    • Data augmentation: Increase the variety of your dataset by rephrasing music requests, introducing typical misspellings, or using diverse ways to request music.
  2. Preprocessing:
    • Tokenize: Break down music queries into tokens, which can be words or subwords.
    • Contextualize: Take into account prior context. This might include details like the last song the user listened to or their current mood.
    • Use NER: Extract specific entities like song titles, artist names, or genres from the user’s query using Named Entity Recognition. This will help in better understanding and categorizing user requests.
  3. Fine-tuning:
    • Set up the model: Start with an LLM that has pretrained weights.
    • Define a task-specific head: For this job, you’d probably want a classification layer where each class matches a different API function.
    • Train: Use your music dataset to adapt the model. Adjust settings, like learning rates and batch sizes, when needed.
  4. Evaluation:
    • Validation: Throughout training, check how well the model is doing using a separate validation set.
    • Testing: After the fine-tuning is done, evaluate how well the model understands music-related queries using a distinct test set.
  5. Deployment:
    • Once you’re sure the model is reliable, add it to your system. Now, it will figure out user’s music wishes and trigger the right API calls, like playMusic() or playLatestMusic().
  6. Feedback Loop:
    • Regularly get feedback on the model’s real-world performance in interpreting music requests.
    • Update the model using new data from time to time. This keeps its performance high and helps it stay in tune with changing music tastes or user behaviors.
  • Important Points to Remember:
  • Compute and Storage Costs: Think about the amount of computer power and storage you’ll need, both for training and for using the LLM.
  • Ethical Matters: Make sure your data respects privacy rules. And aim to reduce any biases in the model, even those related to music.
  • Versioning: When you make updates, keep track of model versions. This way, you can go back to an older one if a new version causes problems.
  • With an LLM that’s been fine-tuned this way, users can tell the system about their music choices in a more natural way. In turn, the system can figure out what they mean and play songs or offer a music experience that fits them just right.

Finer details

In building a music recommendation and playback system:

  1. Entity Recognition: The system identifies key details like song names, artist names, and genre to decide the appropriate playlist or station, ensuring a range of songs rather than just one.

  2. Intent Classification: It determines user’s request type, e.g., general music playback like “play songs by Adele” versus specific requests such as “play Adele’s latest music.”

  3. Context Understanding: Factors such as user’s location, time, holidays, and content preferences (like explicit content) are considered.

  4. Process Overview:

    • Intent Recognition: Determines the primary user action, like “play music.”
    • Slot Filling: Extracts details like song (“Hello”), artist (“Adele”), playback device (“living room speaker”), and volume (“60%”).
    • Argument Building: Uses extracted details to form function arguments, e.g., track="Hello", artist="Adele".
    • Query Resolution: The system matches the intent and details to an API function: playMusic(track="Hello", artist="Adele", device="living room speaker", volume=60).
    • Handling Incomplete Queries: If a query lacks details, the system asks follow-up questions, like clarifying the artist for a song title.
    • Execution: The determined function is triggered, initiating the playback or other actions.


Evaluating the fine-tuned LLM for music intents requires a comprehensive approach that ensures not only its technical performance but also its usability and relevance to users. Here’s a structured plan:

  1. Quantitative Metrics:
    • Accuracy: Calculate the percentage of user queries that the model classifies correctly into the intended API functions like playMusic() or playLatestMusic().
    • Precision, Recall, and F1-score: Especially important if there’s a class imbalance in the API functions. For instance, if users more frequently request to play music than to play the latest music.
    • Confusion Matrix: Understand which categories or intents are commonly misinterpreted.
  2. Qualitative Analysis:
    • User Testing: Engage a diverse group of users to interact with the model in a real-world setting. Gather feedback regarding its accuracy, relevance of music choices, and overall user satisfaction.
    • Error Analysis: Manually review a subset of misclassifications to identify common themes or patterns. This might reveal, for instance, that the model struggles with recognizing certain genres or artists.
  3. Real-world Performance Metrics:
    • Engagement Metrics: Monitor how often users engage with the music played. A decrease in skips or an increase in full song plays can be indicators of good recommendations.
    • Retention Rate: Measure how often users return to use the recommendation feature. A higher return rate can indicate user satisfaction.
    • Feedback Collection: Allow users to provide feedback directly (e.g., “this wasn’t what I was looking for”) and use this feedback to iteratively improve the model.
  4. NER Evaluation:
    • Entity Recognition Accuracy: Since NER is used in preprocessing, measure how accurately the model identifies and categorizes entities like song titles, artist names, or genres.
    • Coverage: Determine the range of entities the model can recognize. It should ideally recognize a wide array of songs, artists, and genres without significant gaps.
  5. Usability Testing:
    • Intuitiveness: Gauge how easily users can formulate queries and if the system’s responses align with their expectations.
    • Response Time: Since it’s a real-time recommendation system, the model’s response time should be quick to ensure a seamless user experience.
  6. A/B Testing (if possible):
    • Comparison with Baseline: Compare the LLM’s performance against a baseline system (perhaps the current system in use or a simpler recommendation model). By randomly assigning users to interact with either system, you can measure differences in user engagement and satisfaction.
  • In essence, using the LLM, you’re dynamically translating natural language instructions into structured function calls that the system can understand and act upon. This approach makes interactions intuitive for users while ensuring precise actions on the backend.
  • It’s about gauging both the implicit and explicit needs and delivering a seamless music experience.
  • Our team’s focus is around customer growth so we serve recommendations that will help grow our customer base
    • This includes, Next Best Action via Multi-armed bandit, where we look to educate inactive users by giving them 3 personalized push notifications, prompting them to perform different actions on the app.
      • The number 3 was decided after several experimentation where we didn’t want to bombard the user but still educate them
    • We also have a partnership with Amazon.com retail where we find correlation between retail products and music latent factors and have it on the Amazon.com page item to item


  • Spinoff from Oracle in the healthcare domain automating administrative and operational task
  1. Creating a Clinical Documentation Tool:
    • Named Entity Recognition (NER): To identify specific entities in the text, such as patient names, medication names, diseases, procedures, dates, and other relevant medical terms.
    • Information Extraction: Beyond just recognizing entities, this task involves extracting relationships and attributes associated with these entities. For instance, understanding that a specific drug was prescribed for a particular symptom or disease.
    • Text Classification: To categorize different parts of a clinical note (e.g., diagnosis section, treatment section, patient history).
    • Topic Modeling: To automatically identify the main topics covered in a clinical note, aiding in quick summarization.
  2. Designing an Information Retrieval System: –> FAISS
    • Document Indexing: Efficiently indexing medical guidelines, patient data, and treatment options for rapid retrieval.
    • Query Understanding: Interpreting what a user (possibly a healthcare professional) is looking for, even if their query is in natural, conversational language.
    • Document Ranking: Sorting the retrieved documents by relevance based on the user’s query and possibly other factors like patient specifics.
    • Semantic Search: Using embeddings and other advanced techniques to ensure the retrieval system understands the meaning and context, not just keyword matches.
  3. Automating Claims Processing:
    • Named Entity Recognition (NER): As mentioned earlier, this would be used to identify specific entities like patient names, diseases, treatments, amounts, dates, etc.
    • Text Classification: To categorize different sections of the claim form or to determine if a particular document is, in fact, a claim.
    • Relationship Extraction: To understand the relationships between entities. For instance, connecting a diagnosis with a specific treatment or procedure.
    • Automated Form Filling: Once relevant information is extracted, populating standardized forms or databases using the extracted data.
    • Error Detection: Using NLP to spot inconsistencies or errors in claims, ensuring higher accuracy.


  1. Modeling Server Capacity Data to Predict Outages:
    • ML Techniques:
      • Time Series Analysis & Forecasting: Methods like ARIMA, Prophet, or LSTM (Long Short-Term Memory networks) to predict server capacity based on historical data.
      • Regression Models: For predicting capacity, techniques like Linear Regression or Support Vector Regression might be relevant.
      • Random Forest & Gradient Boosting: Ensemble methods that can predict server outages based on a multitude of factors and historical data.
  2. Predicting Server Health Using LogBERT to Understand Anomalies:
    • NLP Techniques:
      • Transfer Learning: Using a pre-trained model like BERT (in this case, a variant called LogBERT) and fine-tuning it to analyze server logs.
      • Semantic Embeddings: Representing server logs as vectors in a high-dimensional space using embeddings derived from models like BERT.
    • ML Techniques:
      • Anomaly Detection: Techniques such as One-Class SVM, Isolation Forest, or Autoencoders can be employed to detect anomalies in the log embeddings.
      • Clustering: Using unsupervised algorithms like K-Means or DBSCAN to cluster similar logs and identify potential anomalous patterns.
  3. Outlier Detection for Current Latency and Storage Models:
    • ML Techniques:
      • Statistical Methods: Techniques like the Z-Score, Box-Plot, or IQR (Interquartile Range) for basic outlier detection.
      • Isolation Forest: A tree-based method designed specifically for anomaly and outlier detection.
      • Density-Based Spatial Clustering (DBSCAN): Useful for detecting clusters in data and identifying points that do not belong to any cluster as potential outliers.
      • Autoencoders: Neural network-based approach where the network is trained to reproduce the input data, but anomalies produce higher reconstruction errors.


  • I am a research fellow at the University of South Carolina where I collaborate on a few publications I focus mostly on NLP with a little vision and multimodality

    CT2: AI-Generated Text Detection and Introduction of AI Detectability Index (ADI)

  • Overview:
  • Research Focus: The paper emphasizes the importance of detecting AI-Generated Text and introduces the AI Detectability Index (ADI).
  • Achievement: This paper received the Best Paper Award for its innovative approach.

  • Background on ADI:
  • Definition: ADI is a composite metric that merges two linguistic measures: perplexity (syntactic) and burstiness (lexical).
  • Empirical Basis: The composition of ADI is founded on empirical observations, and its formulation was influenced by the density function according to Le Cam’s Lemma.
  • Reflection & Future Work: The authors self-reflect, suggesting potential alternative features for ADI, and indicate opportunities for future research to expand and refine the ADI definition.

  • Evaluation of Current AGTD Techniques:
    1. Overview: Various methods have recently been introduced to detect AI-generated text. However, the paper argues that most of these techniques are not robust against state-of-the-art models.
    2. Watermarking: Originally proposed to label AI text by switching high-entropy words, this technique is shown to be vulnerable to strategies like word replacements or paraphrasing.
    3. Perplexity & Burstiness Estimation: These techniques aim to identify statistical differences between AI and human-produced text. However, newer models, such as GPT-3, generate text so similar to humans that these methods become less effective.
    4. Negative Log-likelihood Curvature: This was introduced to identify AI text based on how perturbations influence probability. Yet, empirical evidence from the paper suggests it doesn’t offer a reliable indicator, especially for models like GPT-3.
    5. Stylometric Analysis: This method, aiming to discern linguistic style differences, is found to be constrained when applied to modern advanced models.
  • Deep Dive into the AI Detectability Index (ADI):
    1. Primary Objective: The ADI was crafted to quantify the detectability of text generated by various AI models in comparison to human writing.
    2. Components & Calculation:
    • Perplexity: Measures the predictability of word sequences in text. Typically, human text exhibits higher perplexity than AI-generated text.
    • Burstiness: Evaluates the occurrence of repetitive clusters of similar words or phrases. AI-generated text tends to have more bursts, while human text shows more variation.
    • Formula: \(ADIx = \frac{100}{U*2} \times [\Sigma_{x=1}^{U} \{δ1(x) \times (Pt - μplxH) \} + \{δ2(x) \times (Bt - μbrstyH)\}]\) Here, \(U\) represents the number of text samples, and higher ADI values suggest that the AI model’s text is more easily detectable.
      1. Damping Factors:
    • Purpose: \(δ1\) and \(δ2\) are the weighting factors used in the ADI formula to rank language models based on their detectability.
    • Breakdown:
      • \(δ1\) is the weighting factor for the perplexity difference (Pt - μplxH).
      • \(δ2\) is for the burstiness difference (Bt - μbrstyH).
    • Scaling and Contribution: These factors amplify the contribution of the differences to the total ADI score. The values for \(δ1\) and \(δ2\) are derived from the mean and standard deviation of the initial unweighted ADI scores across all language models.
  • The research paper, through its comprehensive evaluation and the introduction of the ADI, seeks to provide the AI community and policymakers a reliable metric to assess AI-generated text detectability.

  • Watermarking:
  • Involves subtly altering the text to encode an imperceptible signal.
  • Paper shows watermarks can be removed by replacing high-entropy words or paraphrasing.

  • Perplexity estimation:
  • Assumes AI text has lower perplexity (more predictable).
  • Paper demonstrates new large models like GPT-3+ have perplexity close to human text.

  • Burstiness estimation:
  • Hypothesis is AI text exhibits more repetitive word bursts.
  • Fails for recent models which show wide vocabulary usage.

  • Negative log-likelihood:
  • Detects AI text based on how perturbations affect probability.
  • Paper shows this does not provide a consistent signal, even with multiple perturbations.

  • Stylometrics:
  • Analyzes linguistic style features to identify AI text.
  • Limited success differentiating advanced neural network models.

  • Classification models:
  • Train classifiers to detect AI vs human text.
  • Performance is often weak, depends heavily on training data.

  • the paper does empirically demonstrate that perplexity and burstiness estimation on their own fail to reliably detect AI-generated text, especially with recent advanced models like GPT-3.
  • However, the authors still propose using perplexity and burstiness as the foundation for their new metric ADI.
  • There are a couple reasons why:

1) While perplexity and burstiness may not work well in isolation, the authors believe combining them could provide a more robust signal. The ADI formula incorporates both features.

2) The ADI introduces additional factors like using human benchmark comparisons, weighting differences by model detectability, and averaging over many samples. So it enhances perplexity and burstiness in a more comprehensive metric.

3) The authors argue that other detection methods like stylometrics and classifications are ultimately dependent on core features like perplexity and burstiness. So distilling it down to these fundamental elements could offer a universal benchmark.

4) As models evolve, other detection methods may fail, but perplexity and burstiness can still indicate how close models get to mimicking human writing patterns.

  • In essence, the authors are proposing a new way to leverage perplexity and burstiness as part of a more robust and adaptable detection metric in ADI. You raise a very fair point though that they are still utilizing features they demonstrated shortcomings of. More research is needed to validate the effectiveness of ADI as models continue to advance.

CONFLATOR: Code Mixing:

  • Certainly, I’ll explain the given passage in simpler terms:
    1. Switching-Point Based Rotary Positional Encoding:
    • The authors introduce a new way to handle positional encoding in neural networks. Positional encoding is a technique used in Transformer architectures (a popular neural network model) to understand the position or order of words in a sentence.
    • The new technique revolves around the idea of “switching points.” Whenever a switch from one language to another occurs in a code-mixed sentence, they change the rotation (or tweak the positional encoding). This helps the model learn when and how languages are mixed within a sentence.
    • This is a new neural network model designed specifically for languages that are code-mixed, like Hinglish.
    • The primary innovation in CONFLATOR is its use of the aforementioned switching-point based rotary positional encoding. Initially, the model looks at each word individually to determine if a switch has occurred. Then, it examines pairs of words (bigrams) to refine its understanding.
  2. Empirical Evidence:
    • The authors claim to have evidence that CONFLATOR successfully learns the patterns of how languages are mixed together in Hinglish. They compare its performance to other models that use different methods to understand the order of words. Their findings suggest that CONFLATOR does a better job at this than other models, as depicted in “Figure 5” (which we don’t have access to in the given text). - In a nutshell, this paper is about introducing a new technique and model for understanding and processing sentences where two languages are mixed together, with a specific focus on the mix of Hindi and English known as “Hinglish.” - Textual Diffusion with Hallucination - Where we’re looking to incorporate factual ground truth during the denoising process to see if that can help mitigate hallucination.

Projects mentioned LoRA

Of course! I’ll explain the procedures and intentions behind each of these tasks:

  1. Few-Shot Learning with Pre-trained Language Models:

    • Performed few-shot learning with pre-trained LLM: This means that a small amount of data was used to adapt (“fine-tune”) pre-existing language models (likely designed for broad tasks) to perform a more specific task. The fact that the models are pre-trained indicates that they already have a good grasp of the language due to previous training on large datasets.

    • such as GPT and BERT from HuggingFace’s libraries: The pre-trained models used were GPT and BERT, which are prominent models for understanding language context. These models were sourced from HuggingFace, a leading provider of state-of-the-art language models.

    • Experimented with more sophisticated fine-tuning methods such as LoRA: After starting with basic fine-tuning, more advanced methods were employed. LoRA (Localized Re-adaptation) is one such method that provides a sophisticated way to adapt a pre-trained model to a new task using a limited amount of data.

    • Used PyTorch framework: All the experiments and model training were done using PyTorch, which is a popular deep learning framework. This gives information about the tools and libraries that might have been employed during the procedure.

  2. Multitask Training for Recommender Systems:

    • Implemented a multi-task movie recommender system: A recommender system was developed that can handle multiple tasks simultaneously. In this context, it might mean recommending various types of content or handling different aspects of recommendations concurrently.

    • based on the classic Matrix Factorization and Neural Collaborative Filtering algorithms: The foundational techniques used for this recommender system are:

      • Matrix Factorization: It’s a technique where user-item interactions are represented as a matrix, and then this matrix is decomposed into multiple matrices representing latent factors. This is a traditional technique often used in recommendation systems.
      • Neural Collaborative Filtering: This is a more modern technique that uses neural networks to predict user-item interactions, thus providing recommendations.

In summary, the first task involved adapting large, general-purpose language models for specific tasks using a small amount of data, while the second task was about building a multi-task recommendation system using traditional and neural techniques.

Technical Breadth

  • I love collaboration and thinking outside the box. Amazon devices, the primary goal was for users to shop.
  • So what I’ve been trying to do is, find correlation between retail items and songs, both for the website and Alexa as well.
  • Item to item recommendations are bread and butter of Amazon

People management

  • I like to lead with empathy
  • Mentorship: make sure everyone has a mentor, helping them find one if not
  • people growth
  • upward influencing: offerring solutions, understanding the perspectives and goals



  • The research coming out of Meta is an inspiration in itself, Meta is a trailblaizer in so many domains:
  • Text to speech: Voicebox where its able to do speech generation tasks it was not necessarily trained on
  • Pure NLP: No language left behind project with translations between 200 languages and including the work with low-resource languages is something I really connect with.
  • Recommender system: embedding based retrieval and so many more
  • And I imagine Smart glasses org to be a culmination of all of this research and so, to be given the opportunity to work there would be a true joy.

Questions for the manager

  • Team structure, I assume since it’s lenses, theres collaboration with a vision team. Are there other modalities at play?
  • Hallucination is the biggest problem with LLMs
  • Smart Glasses (SG) Language AI
  • We focus on Conversational AI, SG Input AI, Knowledge-enriched Discovery AI, Privacy ML and AI Efficiency. Our system powers critical SG launches thrusted by the Meta leadership. We have strong scientists and engineers, solving challenging AI problems with both Cloud based large models and On-Device ML models. Join us if you are passionate about AI-centric next-gen computing platforms and pushing advanced AI at production scale!

  • Our team: Smart input team’s mission is to enhance input and messaging experience on these smart glasses. Imagine being able to receive your whatsapp messages, and being able to get more context (summary) and respond in a natural way just like how you would have a conversation with a human, all while wearing your glasses and not taking your attention off the things you are doing (like biking, cooking, walking with your grocery bags).
  • Our tech: We want to bring ChatGPT capabilities on-device. We build the capabilities similar to what ChatGPT can do for messaging but with a model that is much smaller to be able to fit on the glasses. This is a niche tech space with big opportunities to innovate on LLMs, on-device ML, privacy ML such as Federated learning, on-device personalization. This team aims to ship these cool technologies to drive end user value.
  • While the rest of the world is going after making LLMs work on the servers, we are taking a bigger challenge to make LLMs work on-device.

  • In a more constrained use case, such as using natural language processing (NLP) to interpret voice commands for Amazon Music, the problem of hallucination might be less prominent. In this case, the system is less likely to “make things up” because it’s not primarily generating content but rather interpreting and executing commands based on user input. If the NLP system doesn’t understand a command, it’s more likely to ask for clarification or fall back to a default action rather than inventing an inappropriate response.
  • However, hallucination could still be an issue if the system uses language models to generate responses or explanations. For example, if you ask Alexa for information about a particular artist or song, and it uses a language model to generate the response, it could potentially hallucinate details that aren’t true.
  • In any case, whether hallucination is a problem depends on the specific application and how the system is designed and used. It’s an active area of research in AI to find ways to mitigate this issue, especially as language models are being used in more diverse and impactful applications. Techniques like fine-tuning the model on specific tasks or data, utilizing structured data sources to ground the responses, or using model validation to check the outputs could help to limit hallucination.



  • Implementing RAG in an AI-driven application entails the subsequent procedures:
  1. The user submits a query.
  2. The application scans for pertinent documents that could potentially address the query from a document index, which usually comprises proprietary information.
  3. The application crafts an LLM prompt by merging the user’s query, the identified documents, and directives for the LLM to respond using the given documents.
  4. This constructed prompt is then dispatched to the LLM.
  5. Based on the provided context, the LLM produces a response to the user’s query, which serves as the system’s final output.