Generative AI FAQ from Managers and Executives: an IE Data Scientist's Account of Coaching Industry Professionals

(with MIT Professional Education, and Global Alumni)

At Insight Edge, many of our team members are involved in "extra-curricular activities": building our skills outside of our day jobs, and bringing that experience back home to IE.

In a previous Tech Blog post, I shared about my experience working in collaboration with NASA and Google Cloud, as part of the "Frontier Development Lab" program (see A Data Scientist’s “Sabbatical”: Skill-building through external scientific collaboration)

Other members of IE supervise grad-student research projects, lecture at universities, or even tutor high-school students. Even of those jobs are not connected to IE projects directly, we believe our members benefit from gaining wide experience explaining technology and methods to folks outside of the Sumitomo Corp. network. Learning to teach effectively has a very healthy impact on our ability to engage with stakeholders, and explain the trade-offs of different DX implementations to professionals in SC and our many operating companies.

Since September 2023, I have been acting as a "Learning Facilitator" for MIT Professional Education's "Applied Generative AI for Digital Transformation" online program: a kind of continuing education course, helping industry professionals from magazine writers to pharmaceutical executives update their skills and curiosity regarding Generative AI for Digital Transformation", which gives a mixed bag of GenAI nuts and bolts and best pracitces, as well as overall frameworks for thinking through larger business trends GenAI tools may bring about.

As a Learning Facilitator, I host "Office Hours" with groups of 30-60 participants, taking their Q&A, and helping them digest all the different topics the course brings up. While its the faculty who give the lectures, its my role to go deeper and adapt to each participant's individual learning needs: via office hours, Q&A, written assignments feedback, and any other SOS or sanity check request.

I've really loved this role as it meshes very well with what I often have to do at Insight Edge / Sumitomo Corp.: help our organizations many professionals resolve misconceptions surrounding advanced technology. Help them explore the complex trade-offs, and navigate the common pitfalls of new algorithm R&D and commercial implementation.

"Can I use ChatGPT for the homework?"

While this might seem a bit trivial, it comes up immediately with every cohort, and usually leads to some profound discussion! Naturally with a course about LLMs, students will be very eager to use them; or even guess that they are expected to use LLMs to generate their work. The trap that many fall into is copy-pasting the assignment instructions verbatim into ChatGPT, and then submitting verbatim response as their own work. Since these are working professionals rather than university students, my major concern is that folks are not then obtaining any added value from the tool: neither learning anything from it nor improving the quality of the result.

This can have diasatrous consequences if scaled to their actual professional workloads, e.g. workplace memos, reports, or even consumer facing documents can end up reading extremely generically, and risk resembling content generated by competitors, etc. As a single user, many folks will find the output to look very good. But if we sample the output among many similar and generic prompts, the results tend to be very homogeneous.

The general feedback I give folks here, is really quite similar to what many proactive university instructors are recommending: use LLM tools in a way that allows you to focus your effort on what you do best. Instead of having them write for you, use them to help you get started, to give structure, or to understand what a generic output would look like, so that you can avoid it. You can ask a chatbot for a list of common pros-cons on an issue, then go through that list one by one and respond with your own takes: sort of a way of coaching yourself or "gameifying" your task. After that, you don't even need to keep the text generated by the LLM. Sort of like temporary scaffolding.

For more on the balance between empowering oneself and others with LLMs vs the potential risks of outsourcing thought or creativity to them, I've gathered a handful of papers I usually point students to:

Impact of artificial intelligence on human loss in decision making, laziness and safety in education

AI Art and its Impact on Artists

"How could I be prompting this more effectively?"

One of the most common questions folks ask is how to write effective prompts. Actually, that's not true: its actually one of the most chronically unasekd questions. Many students will cling to the most basic, minimal, or copy-pasted prompts --- because they work. Or at least, they appear to work well enough on the surface. And for folks whose primary motivation in using LLMs is simply to save time, they're typically so very impressed with the speed that the LLM will produce a pretty-looking e-mail that they won't look to improve. Thus I focus a lot on, not answering this question necessarily, but in getting students in the habit of asking themselves this question whenever they use an LLM --- especially when they are in a hurry, because that's where poor quality output tends to be overlooked. To help folks get in that habit, I make some very minimal suggestions for them to always use:

  • Never use generic prompts:
    • In other words, verbatim prompts copied from a prompt ``cheat sheet'', or any other widely-available source: while the overall structure of a prompt can be re-used -- and probably should be, so that you can experiment systematically --, copying a prompt exactly runs the risk that the response itself will be generic (and likely to resemble the work of others!)
  • Structure the prompt with keys and values:
    • While it may feel the most natural to prompt an LLM in conversational language, you might obtain better control over the result if you apply some more machine-like formatting. For example, when prompting for writing samples, you could choose a set of 3 keys, "task: , voice: , vocabulary level: ***" Remember that LLMs are trained not only on prose or conversational language, but also on a wide variety of data formats and programming languages. You can thus use some programming-like syntax in your request to try and remove some ambiguity. For students who code, I usually point them to JSON or YAML syntax. You can use such formats as templates for structuring a prompt, whether or not your request has anything to do with programming. For students who are less familiar with data and code, I usually just provide an example of key-values (typically ends up looking YAML-like), but tell them to keep it simple, and experiment. When in doubt, its always better to have clear structure to the prompt: that way you can easily revisit and tweak different parts of it, to hone in on the results you want.

How can I use JSON to prompt more complex stories and tasks?

Folks interested in prompting using JSON or other structured formats might have a look at the following Reddit thread:

JSON as prompts : r/ChatGPT

The thinking is that this makes the key context components more recognizable to the LLMs trained features, and takes some ambiguity out of the outputs.

You could also try something simple to start with, maybe a quick JSON of 10 or so parameters for a software suite outline. Start the prompt off with something like “What follows is a JSON with parameters defining a desired brainstorming request. Please generate a response based on the parameters in the JSON”, then paste the JSON, and see what you can come up with. Based on that experience, you could also look into more systematic prompt-structuring tools like LangChain.

# Does RAG cure hallucinations?

No! RAG is a great tool, but it's not a "set and forget" barrier against hallucinations or falsehoods.

A lot of students are introduced to the idea of "Retrieval Augmented Generation" during this course or others, and are very excited about it. RAG is a very powerful technique for getting an LLM to stay relatively constrained to a particular set of documents. However it's not a subsitute for quality control. LLMs can still produce hallucinations even when connected to a RAG system. This is because the LLM model itself at the core of whatever such app you develop is not changed at all by the RAG system. Rather, its being carefully fed context in the form of chunks of your chosen "corpus", or document collection (Tom Yeh's "AI by Hand" seris contains a great guide for beginner's on how RAG works ) has produced a fantastic set of beginner's guides for AI topics. Have a look at These context clues then go into the LLM along with your prompt, strongly biasing the response towards that context. The LLM weights themselves are not changed, and we can still see false or misleading information appear in the output. RAG-enabled LLM apps are still going to be much more constrained to your task than a vanilla LLM implementation! So don't let this answer scare you. Just remember that whatever support system you use to filter, constrain, loop, or otherwise augment an LLM based system, you still have to be very careful to monitor the system for problematic outputs.

For more on RAG and hallucinations, I typically point folks to references such as the following:

[2311.04235] Can LLMs Follow Simple Rules?

As for quality control approaches LLM-based apps and systems, Weight&Biases has put out several helpful resources:

How to Evaluate, Compare, and Optimize LLM Systems Evaluating Large Language Models (LLMs) with Eleuther AI

I find that students also benefit from a sense of caution about an LLMs adherence to "rules" we give them in our prompts. While our requests may be formatted in absolute terms, "Never use this word" or "You must use this pattern in every paragraph", there is no guarantee that the response will adhere to such terms. Often the results can be satisfactory or even well-above expectations, but LLMs should not be confused with something like a command line where variables and functions can be strictly defined. This follwoing article gives an interesting deep-dive into the ability of LLMs to adhere to rules set out in the prompts we provide them: * Can LLMs Follow Simple Rules?

How restrictive are the guard rails on some of these AIs?

They can be at the same time overly restrictive but also failing to prevent harmful uses. As folks have pointed out more and more biases, likely coming from poor training data curation and testing, the developers have resorted to more quick-and-dirty means of avoiding embarrassing outputs, e.g. simply filtering: this is applied both on the prompt itself, and again once the output has been generated. We have yet to see any sophisticated “constructive” guardrails, i.e. those that help the user understand what kind of responses are problematic, how to double-check them, and how to maintain quality overall.

More recent methods have tried to add to this a kind of "intermediate" filtering, wherein the actual internal layers of the LLM are sampled for problematic features. If detected, this triggers the overall app serving the LLM from producing any output, whether or not the output itself explicitly contains any of the blacklisted words.

Building Guardrails for Large Language Models

What's the state of High privacy LLM tools and implementations?

This is still a major challenge with LLM applications: currently, the most powerful models are closed-source, and require one to send requests via an API or web-interface to the developer’s external service. LLMs themselves are not inherently insecure, in the sense that just running “inference” (i.e. running an input through a trained model) does not require any data to actually be saved by the model. Sort of like running current through your house’s wiring doesn’t change anything about the wiring itself.

The trouble is about how the LLM service provider’s overall web-app, and servers handle your data.

If you are using such provider’s tools to “fine-tune”, or create “agents”, “assistants”, etc., the situation is completely different. This is because to create such customised or fine-tuned models, your data is being used to train the model. To what extent your training data is extractable from the assistant is still an active area of research. We have seen instances of “jail-breaking” (engineering prompts in specific ways such to bypass safety filters, or extract snippets of the original training data) wherein security researchers are able to cause such agents to “leak” sensitive data. Note: this in principle could happen whether you fine-tune using a third-party platform or on your own.

As for training a model securely, the idea of “Federated Learning” has come up now and then: basically using a network of training nodes to collaboratively train a model; the data itself however never leaves its respective node.

Let me share also some resources on data privacy of LLMs, and also on using LLMs themselves to mask sensitive data. Finally, a tool I’ve discovered recently that can be used to generate synthetic data that mimics sensitive data examples.

Ethics vs. benchmarks for performance: testing as tools evolve, etc.

The ethical concerns surrounding GenAI make the questions of benchmarks all the more important! Bias, security, quality control in general are very difficult to systematically measure without good benchmarks. My personal take is a tad different from the Faculty: while I understand they have their own personal experiences with OpenAI/ChatGPT, I don’t think an individual’s personal experiences with one tool or another are enough to make organisation-level decisions. On the other hand I agree with them in that I don’t think a single “Leaderbord”-style general accuracy benchmark is sufficient for such decisions: does a marginal % difference in accuracy score really mean one model is better for your specific needs than another? I recommend looking at a suite of benchmarks designed around specific domain applications, and even considering developing your own internal benchmarks and red-teaming. Let me share some examples of community-developed metrics for various domains. For more, have a look at the “Metrics” section of HuggingFace.co. Most of these are rather rudimentary statistical measures, but if you will also note a good number that are for evaluating language models for various tasks from “text simplification” and “information retrieval” to performance on standardized tests, etc.

Example references

How have GenAI tools have changed in the past 6 months?

When I first LF’d for the course back in September 2023, I felt the community was somewhat dismissive of things like:

  • Potential of smaller LLMs
  • Self Finetuning
  • In-house (e.g. open source) model implementations.

These approaches have been gaining more momentum recently. A lot more research has appeared in this time showing applications where smaller models might excel (constrained use-cases). Novel “multi-agent” architectures have also gained popularity, i.e. using multiple LLMs to parse tasks. Need for security and transparency of training data has also become louder: thus quieting some of the folks advocating for ChatGPT/OpenAPI as a one-stop-shopping solution. Open-source models have also made great strides in this time, most notably Meta’s “LLama”. At the same time, tools and instructional resources allowing folks to do their own fine-tuning, etc. have also become more common.

Example References

What's one major and direct worry about the current course of LLM evolution/propagation?

I think the flood of generated content onto the internet would be a key issue to think about here. Not just how its being generated but how it affects the average user's perception of digital content. This preprint dives into that issue as it affects visual art on platforms like “Pixiv”: Understanding the Impact of AI Generated Content on Social Media: The Pixiv Case Please also consider the perceived challenge that some researchers are putting forward, wherein models become harder to train because the fraction of easily-scraped, genuine human-created examples on the internet may decrease. (see “Model Collapse”: Model Collapse Demystified: The Case of Regression )

How about those small models?

This How-To by Intel gives a decent, objective break-down of model performance (in terms of hallucination rate) by parameter size (do smaller models hallucinate more) You’ll find tons of other blog posts on the subject, but many don’t actually provide hard statistics and/or are likely just advocating for their own offerings. (Granted, Intel is not exactly a “neutral observer” !) For more objective comparisons, you might consider taking a dive into ArXiv – just keep in mind that many papers there have not necessarily been peer-reviewed yet. Still, it can be a great source for understanding the latest R&D trends. As mentioned in the Office Hours, HuggingChat is a great platform for quickly comparing response quality for models of different sizes (HuggingFace.co/chat) Aside from the debate of large vs. small, I’ve found this novel approach gaining attention: Breakthrough: Zero-Weight LLM for Accurate Predictions and High-Performance Clustering - Machine Learning Techniques (the “zero-weight” description may be a bit of a misnomer)

Which use-cases do we see most in organisations?

Chat bots are the most visible ones! These can be consumer-facing, or for internal applications: e.g. HR policy search-bots, Data interaction. Also have a look at Sentiment Analysis. The biggest potential I’ve personally noted is however in the way individual contributors empower their workflows with GenAI tools via on-the-job learning and skills refining. FYI I found the following interview with the founder of “Writer” to be quite interesting: Leveraging Hugging Face for complex generative AI use cases.

How is building GenAI applications/ products different from other SW applications/ products?

I would say the biggest difference is in the quality control steps, lack of determinism in the outputs: it’s tricky to get LLMs to provide consistent QoS, and to have them adhere to strict rules -> rigorous testing becomes more of a “fuzzy” process. Otherwise the experimental nature of LLMs in their current state needs to be taken seriously. Developers are good at making these systems appear polished and professional, but we are only just beginning to map out the pitfalls of LLM-tech-at-scale. New model versions are being released much faster than the R&D community is able to assess and stably-integrate them, let alone for enterprses to fully comprehend how their customers will be impacted. This is not to scare folks away! Rather, just to remind us all not to confuse cutting-edge with well-vetted.

Which roles and capabilities are needed in an LLM team?

Of course new skills like “Prompt Engineering” are critical; not necessarily in the sense that you need to hire a “veteran”, but that your team needs to be accumulating knowledge and experiment results on the best ways to write prompts. But more than that I suppose not overlooking the traditionally required skills is critical! E.g. folks skilled in: Software testing, Scaleable cloud architectures, and system redundancy engineering (i.e. being able to swap in alternate models and providers in the case a major product (API) becomes unavailable – especially true for GenAI as the field is so fastly evolving!! And we are still very much in the “hype bubble”) I highly recommend following Eduardo Ordax (GenAI lead at AWS), and his thoughts on MLOps: basically, the hype of GenAI has distracted folks from MLOps and DevOps, but scaling GenAI is going to require us to get back to those.