News

AI Can Now Understand Your Videos by Watching Them

Labeling objects is easy for humans, but difficult for computers.

  • Researchers say AI can teach video to watch, listen and tag.
  • AI systems learn how to represent data to capture concepts shared between visual and audio data.
  • It’s part of an effort to teach AI to understand concepts that humans can easily learn but that computers can’t.

Artificial intelligence robot touching futuristic data screen.

Chino Yuichiro / Getty Images

A new artificial intelligence (AI) system can watch, listen to, and flag events that occur.

MIT researchers have developed a technique that teaches AI to capture motion shared between video and audio. For example, their method allows you to understand that crying a baby in a video has to do with “crying” as they say in the sound clip. It’s part of an effort to teach AI to understand concepts that humans can easily learn but that computers can’t understand.

“Supervised learning, the dominant learning paradigm, works well when you have well-explained and complete data sets,” AI expert Phil Winder told Lifewire in an email interview. “Unfortunately, the data set is rarely complete because the real world has a bad habit of representing new situations.”

AI getting smarter

Because computers, like humans, have to process data rather than sounds and images, they struggle to understand everyday scenarios. When a machine “sees” a photo, it needs to encode it into data that it can use to perform tasks such as classifying images. AI can become stagnant when the input is in multiple formats, such as video, audio clips, or images.

“The biggest question is how the machine can coordinate these different ways. As humans, it’s easy,” Alexander Liu, an MIT researcher and first author of a paper on the subject, said in a press release: “We see cars and hear cars passing by and we know it’s the same thing. But when it comes to machine learning, it’s not easy.”

Liu’s team has developed an AI technique that says it learns how to represent data in order to capture concepts shared between visual and auditory data. Based on this knowledge, machine learning models can detect and label where specific actions occur in the video.

The new model takes and encodes raw data, such as video and its text captions, to extract features or observations about objects and motions in the video. You then map these data points into a grid called an embedding space. The model clusters data similar to a single point on a grid. Each of these data points or vectors is represented by a single word.

For example, you can map a video clip of a person juggling to a vector labeled “Juggle”.

The researchers designed the model so that only 1,000 words were used to label the vector. The model can decide which operations or concepts to encode into a single vector, but only 1,000 vectors. The model chooses the words it thinks best represent the data.

“If you have a video about a pig, the model can match the word ‘pig’ to one of 1,000 vectors. Then, if the model hears someone say the word “pig” in the audio clip, it still has to encode using the same vector,” explained Liu.

video, decryption

Marian Beszedes, head of research and development at biometrics company Innovatrics, told Lifewire in an email interview that better labeling systems, such as those developed at MIT, could help reduce AI bias. Beszedes suggested that the data industry could look at AI systems in terms of manufacturing processes.

“The system takes raw data as input (raw material), preprocesses it, collects it, makes a decision or prediction, and analyzes the output (finished product),” says Beszedes. “We call this process flow the ‘data factory’ and like any other manufacturing process, it must be subject to quality control. The data industry must deal with AI bias as a quality issue.

“From a consumer’s point of view, for example, it’s difficult to search for a specific image/video online if the date is wrong,” Beszedes added. “With properly developed AI, automatic labeling can be done much faster and more neutrally than manual labeling.”

An MIT AI model that identifies and labels where specific actions occur in a video.

MIT News

However, the MIT model still has some limitations. First, while their research focused on data from two sources at once, in the real world, people are exposed to many types of information at once, Liu said.

“And we know that 1,000 words will work for this type of data set, but we don’t know if we can generalize to real-world problems,” Liu added.

MIT researchers say their new technology outperforms many similar models. If an AI could be trained to understand a video, it could skip a friend’s vacation video and get a computer-generated report instead.


More information

AI Can Now Understand Your Videos by Watching Them

Labeling things is easy for humans, but challenging for computers

Researchers say they can teach AI to label videos by watching and listening. 
The AI system learns to represent data to capture concepts shared between visual and audio data. 
It’s part of an effort to teach AI how to understand concepts that humans have no trouble learning, but that computers find hard to grasp.
Yuichiro Chino / Getty Images

A new artificial intelligence system (AI) could watch and listen to your videos and label things that are happening. 

MIT researchers have developed a technique that teaches AI to capture actions shared between video and audio. For example, their method can understand that the act of a baby crying in a video is related to the spoken word “crying” in a sound clip. It’s part of an effort to teach AI how to understand concepts that humans have no trouble learning, but that computers find hard to grasp. 

“The prevalent learning paradigm, supervised learning, works well when you have datasets that are well described and complete,” AI expert Phil Winder told Lifewire in an email interview. “Unfortunately, datasets are rarely complete because the real world has a bad habit of presenting new situations.”

Smarter AI

Computers have difficulty figuring out everyday scenarios because they need to crunch data rather than sound and images like humans. When a machine “sees” a photo, it must encode that photo into data it can use to perform a task like an image classification. AI can get bogged down when inputs come in multiple formats, like videos, audio clips, and images.

“The main challenge here is, how can a machine align those different modalities? As humans, this is easy for us,” Alexander Liu, an MIT researcher and first author of a paper about the subject, said in a news release. “We see a car and then hear the sound of a car driving by, and we know these are the same thing. But for machine learning, it is not that straightforward.”

Liu’s team developed an AI technique that they say learns to represent data to capture concepts shared between visual and audio data. Using this knowledge, their machine-learning model can identify where a specific action is taking place in a video and label it.

The new model takes raw data, such as videos and their corresponding text captions, and encodes them by extracting features or observations about objects and actions in the video. It then maps those data points in a grid, known as an embedding space. The model clusters similar data together as single points in the grid; each of these data points, or vectors, is represented by an individual word.

For instance, a video clip of a person juggling might be mapped to a vector labeled “juggling.”

The researchers designed the model so it can only use 1,000 words to label vectors. The model can decide which actions or concepts it wants to encode into a single vector, but it can only use 1,000 vectors. The model chooses the words it thinks best represent the data.

“If there is a video about pigs, the model might assign the word ‘pig’ to one of the 1,000 vectors. Then, if the model hears someone saying the word ‘pig’ in an audio clip, it should still use the same vector to encode that,” Liu explained.

Your Videos, Decoded

Better labeling systems like the one developed by MIT could help reduce bias in AI, Marian Beszedes, head of research and development at biometrics firm Innovatrics, told Lifewire in an email interview. Beszedes suggested the data industry can view AI systems from a manufacturing process perspective.

“The systems accept raw data as input (raw materials), preprocess it, ingest it, make decisions or predictions and output analytics (finished goods),” Beszedes said. “We call this process flow the “data factory,” and like other manufacturing processes, it should be subject to quality controls. The data industry needs to treat AI bias as a quality problem.

“From a consumer perspective, mislabeled data makes e.g. online search for specific images/videos more difficult,”  Beszedes added. “With correctly developed AI, you can do labeling automatically, much faster and more neutral than with manual labeling.”

MIT News

But the MIT model still has some limitations. For one, their research focused on data from two sources at a time, but in the real world, humans encounter many types of information simultaneously, Liu said

“And we know 1,000 words work on this kind of dataset, but we don’t know if it can be generalized to a real-world problem,” Liu added. 

The MIT researchers say their new technique outperforms many similar models. If AI can be trained to understand videos, you may eventually be able to skip watching your friend’s vacation videos and get a computer-generated report instead.

#Understand #Videos #Watching

AI Can Now Understand Your Videos by Watching Them

Labeling things is easy for humans, but challenging for computers

Researchers say they can teach AI to label videos by watching and listening. 
The AI system learns to represent data to capture concepts shared between visual and audio data. 
It’s part of an effort to teach AI how to understand concepts that humans have no trouble learning, but that computers find hard to grasp.
Yuichiro Chino / Getty Images

A new artificial intelligence system (AI) could watch and listen to your videos and label things that are happening. 

MIT researchers have developed a technique that teaches AI to capture actions shared between video and audio. For example, their method can understand that the act of a baby crying in a video is related to the spoken word “crying” in a sound clip. It’s part of an effort to teach AI how to understand concepts that humans have no trouble learning, but that computers find hard to grasp. 

“The prevalent learning paradigm, supervised learning, works well when you have datasets that are well described and complete,” AI expert Phil Winder told Lifewire in an email interview. “Unfortunately, datasets are rarely complete because the real world has a bad habit of presenting new situations.”

Smarter AI

Computers have difficulty figuring out everyday scenarios because they need to crunch data rather than sound and images like humans. When a machine “sees” a photo, it must encode that photo into data it can use to perform a task like an image classification. AI can get bogged down when inputs come in multiple formats, like videos, audio clips, and images.

“The main challenge here is, how can a machine align those different modalities? As humans, this is easy for us,” Alexander Liu, an MIT researcher and first author of a paper about the subject, said in a news release. “We see a car and then hear the sound of a car driving by, and we know these are the same thing. But for machine learning, it is not that straightforward.”

Liu’s team developed an AI technique that they say learns to represent data to capture concepts shared between visual and audio data. Using this knowledge, their machine-learning model can identify where a specific action is taking place in a video and label it.

The new model takes raw data, such as videos and their corresponding text captions, and encodes them by extracting features or observations about objects and actions in the video. It then maps those data points in a grid, known as an embedding space. The model clusters similar data together as single points in the grid; each of these data points, or vectors, is represented by an individual word.

For instance, a video clip of a person juggling might be mapped to a vector labeled “juggling.”

The researchers designed the model so it can only use 1,000 words to label vectors. The model can decide which actions or concepts it wants to encode into a single vector, but it can only use 1,000 vectors. The model chooses the words it thinks best represent the data.

“If there is a video about pigs, the model might assign the word ‘pig’ to one of the 1,000 vectors. Then, if the model hears someone saying the word ‘pig’ in an audio clip, it should still use the same vector to encode that,” Liu explained.

Your Videos, Decoded

Better labeling systems like the one developed by MIT could help reduce bias in AI, Marian Beszedes, head of research and development at biometrics firm Innovatrics, told Lifewire in an email interview. Beszedes suggested the data industry can view AI systems from a manufacturing process perspective.

“The systems accept raw data as input (raw materials), preprocess it, ingest it, make decisions or predictions and output analytics (finished goods),” Beszedes said. “We call this process flow the “data factory,” and like other manufacturing processes, it should be subject to quality controls. The data industry needs to treat AI bias as a quality problem.

“From a consumer perspective, mislabeled data makes e.g. online search for specific images/videos more difficult,”  Beszedes added. “With correctly developed AI, you can do labeling automatically, much faster and more neutral than with manual labeling.”

MIT News

But the MIT model still has some limitations. For one, their research focused on data from two sources at a time, but in the real world, humans encounter many types of information simultaneously, Liu said

“And we know 1,000 words work on this kind of dataset, but we don’t know if it can be generalized to a real-world problem,” Liu added. 

The MIT researchers say their new technique outperforms many similar models. If AI can be trained to understand videos, you may eventually be able to skip watching your friend’s vacation videos and get a computer-generated report instead.

#Understand #Videos #Watching


Synthetic: Vik News

Đỗ Thủy

I'm Do Thuy, passionate about creativity, blogging every day is what I'm doing. It's really what I love. Follow me for useful knowledge about society, community and learning.

Trả lời

Email của bạn sẽ không được hiển thị công khai. Các trường bắt buộc được đánh dấu *

Back to top button